>> Xiaojin Zhu: It's my great pleasure to introduce... from Wisconsin Computer Science department; then he did a postdoc...

advertisement
>> Xiaojin Zhu: It's my great pleasure to introduce Burr Settles. Burr graduated with a PhD
from Wisconsin Computer Science department; then he did a postdoc with Tom Mitchell at
Carnegie Melon; now his most well-known famous book is a survey book on active learning, and
he is also very famous for writing songs and performing an actual event. Okay. So today he's
going to tell us about interactive machine learning.
>> Burr Settles: Thanks Jerry. It's a pleasure to be here. I think I'm probably more famous for
active learning than for my band although sometimes I wish it were the other way around. So
it's a pleasure to be here. It's been fun to talk to a lot of the faces that are here today and learn
about what you guys have been doing. So a lot of this talk is definitely preaching to the choir
and it sounds like a lot of things that you already know, but bear with me.
Hopefully it's not still boring to you now, but I'll start out by talking about the typical passive
machine learning paradigm. We’ve got these three elements. There's a human expert who
knows something about the problem and a machine learning system that you want to train to
automatically solve the problem and then you’ve got the data which typically starts out as an
unlabeled data and there's a bit of a bottleneck problem in that there's a lot of the data. If
there weren't a lot of it we probably wouldn't want to be training machine learning systems to
solve the problems for us. So what got me interested in this is when I was a graduate student,
let me switch over here, one of my first projects was this information extraction system for
biomedical texts. So this is a little GUI demo which amazingly still works after a decade of not
touching the code.
So the task here is given a biomedical journal article it will automatically extract things like
genes and the proteins that are coded for by those genes and cell lines and cell types and things
so they are very related problems that you guys are working on extracting, names and
addresses and phone numbers or people in organizations and locations out of newswire text.
However, this was a less popular and less studied problem than those other things and so there
was a paucity of data for them so the labeling bottleneck was very real and that got me
interested in how to kind of optimize that and make better use of the human’s time.
So if we were to assign a grade to this paradigm we have machine learning student or
something then you’d probably give this a C plus, but since we're computer scientists maybe a
C plus plus. I spoiled my own joke. So the question is can we do better? And I take inspiration
from this quote: Computers are useless. They can only give you answers. Maybe we can make
them more useful if they'll ask questions as well. I don't know if Pablo Picasso actually said this
but the Internet says so I can't find any definitive evidence that he said it.
So this informed the active learning paradigm where now we’ve got this cycle rather than just
humans selecting some amount of the unlabeled data to label it and hand it off to machine
learning algorithm. Now we can label some small subset, give it to a machine learning
algorithm, then that algorithm can inspect the swaths of unlabeled data and hopefully in a
targeted way make better use of the human’s time.
So here was an interesting maybe negative result from my thesis work where, I don't know if
I'm not supposed to stand in front of the light so ignore this chart for now. This is a typical
learning curve where you have the number of instances in an information extraction problem
and what we would like to see is what you're seeing here that with less labeled data we can get
higher accuracies or in this case F1 measure in the active case versus the passive case.
So this was for a real user experiment that I did on this biomedical name identity recognition
task. So this is what we hoped to see and it’s pretty typical. However, since it was a real
experiment it was actually logging the amount of time that the human spent on the tasks. It
was interesting to compare actual wall clock time. So this is the exact same learning curve but
where the cost function is different it's actually time and not necessarily the number of
instances. It turns out the instances that were hard for the machine to label, and so it wanted
more information about, turned out to also be hard for the humans to label and so the actual
gains kind of washed out because we weren’t taking the right cost function into consideration.
So this inspired sort of a variant of active learning that I was calling interactive learning.
Interactive learning is maybe too broad and it can be interpreted a lot of different ways. So like
machine teaching or machine coaching or something like that is maybe more along the right
lines. But the idea is to treat this process more like a conversation between the human experts
and the learning algorithms. So we can let the human show some initiative by offering some
advice about the problem. So we go into the problem presumably with some domain
knowledge, maybe it's not perfect but we have an idea of what characterizes the problem, and
let the machine show some initiative by a few different things. So one is asking questions or
active learning, and another is attempting to teach itself from all of the unlabeled the data that
it has available and then maybe that can inform the questions that it asks and the types of
domain knowledge that it tries to elicit from us. So this is a very mixed initiative paradigm in
that both the humans and the machines have some agency and both have multiple ways of
interacting with each other. So I'll make this a little bit more concrete throughout the rest of
the talk.
But what this might look like in this interactive machine learning or machine coaching sort of
scenario is before any labeling has been done just Tabula Rasa. The human expert has an idea
so this is an information extraction task of extracting bib tech records out of actual citations in a
document. So you think the word proceedings that's a good predictor of the book title field. So
this is a rule that I can just come up with. Another one might to take a slightly different form, I
don't know why it's not showing up on this screen, there we go. Another one is I can maybe
define a regular expression of four digits in a row. That seems to be a predictor of dates. It
could also be things like page numbers. So the advice doesn't even need to be perfect but it
needs to be kind of pointed in the right direction.
So the language of advice we could give, here's a repeating sequence of initials that is probably
indicative of the author field, and then I can come up with these ideas and then hand them off
to the machine learning algorithm. Then it can take this, so far no instances have even been
labeled per se, and the algorithm can inspect large amounts of data and then through some
semi-supervised learning try to explain what it sees down here given those rules as best as it
can, and then ask questions like there's this word conference that appears in the same
neighborhood as proceedings. I think that might be the book title field or can you label this
feature for me as well? Or maybe everything we've seen so far have been conference
proceedings. I don't know what a journal is. Maybe you should annotate this journal citation
for me. And then the human can turn around, answer those queries given to the machine
learning algorithm, and this whole process can repeat. Ideally the human can even come up
with more rules or advice domain knowledge at any time and inject those into the system.
>>: You have no labels here? At what point did you inject labels?
>> Burr Settles: At this point. When the algorithm asked for me to label this citation.
>>: So there's a few [inaudible] the instance of that field is the segment? I still do not
understand what the task was. Sorry.
>> Burr Settles: So the task is given a citation, segment that citation into the title field, the
author field, the year, the publication venue, so on and so forth.
>>: So I guess I'm confused by at which part do you specify the features and which part do you
specify the fields.
>> Burr Settles: So my thinking and the sorts of ways that I have been approaching these
problems is a little bit different than it sounds like with the Ice System how you guys have been
approaching things. So you can think of a bag of words of features and all the features just
already exist so there's no feature selection or feature generation.
>>: [inaudible]?
>> Burr Settles: Right. But we can label the features themselves and I'll talk about the machine
learning algorithms for doing that in a few different scenarios. So part of the mixed initiative
aspect here is that we can not only label instances, which in this case means segmenting and
chopping up the citation into the appropriate fields, but you can also label features and say
when I see this word or when I see this capitalization pattern or something those are predictive
of these labels.
>>: [inaudible] segments [inaudible].
>> Burr Settles: Right. So it's kind of a domain knowledge rule that you can provide without
even being linked to any specific example.
>>: And the constraint there is that this word will only appear in the positive?
>> Burr Settles: Not necessarily. So the rules could be imperfect. There could be noise in
them, but hopefully the semi-supervised step would help it to sort of tease out the context
under which this rule really applies versus when it doesn't apply.
>>: [inaudible]?
>> Burr Settles: So in the case of conditional random fields to do this extraction you can think
of the algorithm that we use for incorporating this domain knowledge as a regularization turn.
And in a few minutes I'll actually go into the details of how that works.
So there's a couple of different research papers that I would say falls into this paradigm, but
before diving into those I would argue that this plays to our complementary strengths. So
machines are relatively inexpensive in terms of their time, their high bandwidth, they don't
complain to go through hundreds of thousands or millions of examples where as at least I
would, but as far as knowledge and intuition about the problem machines are kind of dumb.
They’re brute force but if they already knew how to solve the problem we wouldn’t need to
teach them. However, actually that's where we have our strength. So the active learning helps
to reduce the expense of the time spent on the task. High-bandwidth semi-supervised
something plays to the strength of the machine to be able to crunch through lots and lots of
unlabeled data, and the advice giving part of the dialogue helps us to take advantage of our
knowledge and intuition about the problems. So if we can build a paradigm that incorporates
all of these things then that paradigm is maybe an A plus as opposed to a C plus plus paradigm.
So I'll talk about these three different research projects over the years. Active feature labeling,
a system that I built a few years ago called DUALIST, and then briefly if there's time at the end
I'll talk about a collaboration actually with Jerry about looking at DUALIST and kind of a crowd
sourcing setting. The first two are focused on mainly user interfaces and modifications to
machine learning algorithms to facilitate this kind of dialogue; and the third section is kind of
taking the camera, pointing it back at the humans to understand how the human behavior in
this kind of system what differences that makes in ultimate accuracy of the systems.
So this was an EMNLP paper in 2009 with Andrew McCollum and Gregory Druck, who at that
time was at University of Massachusetts Amherst. So this is a sequence extraction problem. So
going back to the citation extraction we can think of the condition random fields as a
probabilistic finite state machine. So if we have this citation here the extraction corresponds to
the most probable path through this model where the states correspond to the different labels.
So M.J. and Schervish are both self-transitions on the author state Theory of Statistics.
Springer, 1995. Modulo punctuation.
So this is the underlying model that we are working on but I'm not going to get into too many
details of what CRFs look like under the hood. But to motivate this feature labeling idea, so
features in this case are the input variables used by the machine learner and a feature label is a
rule, just a simple if then rule that can express the human knowledge. So, for example, these
are examples I gave before. If the word is proceedings that implies the book title label, if it's
four consecutive digits, it implies a date, if it's a repeating sequence of initials maybe that
implies author or editor, and the way we want to incorporate this information into the
conditional random field or more generally any feature implies some probability distribution
over labels given that that feature fires for the instance or that token. And so here are some
reference probability distributions. There's a lot of different ways you can formalize this, but
let's say this rule implies that every time you see the word proceedings it's a book title or every
time you see the digits it's a date and maybe half the time it's an author versus an editor. It
turns out the exact probability distribution you can make this more and more accurate but it
doesn't seem to matter all that much.
>>: I have a question. So imagine I use a bag of words [inaudible] and then the words that the
teacher provides use a different regularizer because the problem is different. The teacher has
more knowledge that this one is important so it needs to be regularized differently. Now with
these two regularizers the correlation between the book title and proceeding [inaudible] is
going to be learned with the new regularizer. Wouldn't that have exactly the same effect? But
it's very easy to formulize if you formulize this way. You have the bag of words [inaudible]
regularizer and what's suggested and you don't need to say oh, proceeding is positive because I
have data, I have labels. So it will infer that proceeding is positive but only to specify yes it's
always there because I have the data. I can infer from the data. I just need to regularize it
differently because the prior is different.
>> Burr Settles: So I guess I'm not sure I understand the question. So in this case you can
actually do this learning with absolutely no labeled data in the sense that>>: That is the difference I guess. But if I have just even a little bit of labels all these numbers,
how correlated this proceeding [inaudible] that will be very quickly inferred, right?
>> Burr Settles: I guess it depends on how frequently that label, that feature in that context
being labeled actually shows up in corpus. But probably when you're sitting down no work has
been done yet. As a human you sit down to think of these rules you're probably going to think
of the highest coverage, most salient, most predictive, most correlated sorts of features, at
least at the beginning.
>>: [inaudible] if you use backup words you’re going to need a few labels anyway. So you can
have a few labels. The words you're going to put all have enough data so I'm just wondering if I
put a prior I have to think about that is 75 percent of the time predictive of this but now I have
to think 75, is it 80, is it 90. If I just regularize it I’ll get the learning to do that.
>> Burr Settles: Right. And that's kind of what's happening here. You don't have to think too
hard about proceeding is 90 or 80 percent of the time. This probability distribution doesn't
have to be very precise because how, and actually in the next slide I'll show how we incorporate
these rules, so this is the learning problem. This is the objective function. So we want to
minimize the Kale Divergence between all of these reference probability distributions. So this is
the feature label distribution that is generated by the fact that you labeled this particular
feature. So this could be 100 percent, it could be 50 percent, it could be 80 percent; it doesn't
matter as long as it's kind of pointed in the right direction. And then this is the expected model
distribution. So given the current parameters of the model I'm going to scan through all of the
unlabeled data and 90 percent of the time when I see this feature I think it is this label. I want
to get these two distributions to be as similar as possible, so we repeat this for all of the rules
that are available subject to some regularization term to prevent over-fitting. It's a little bit
complicated, but it turns out you can optimize this using gradient descent with LBFGS. You
might even be able to do this in a stochastic gradient descent setting but we were doing batch
here.
>>: So you could have the constraint that four digits met the date even though there were no
four digit dates in your [inaudible] and this would force it to>>: So [inaudible] zero labels.
>> Burr Settles: This works with zero labeled instances. It could also be that the rule is bad in
the sense that there are many four digit tokens that are part of the page field of the ontology or
whatever, but it could be the other contextual information either through the rules or through
the annotated instances it's okay if we disagree some of the time with these reference
distributions because we're just trying to minimize the disagreement.
>>: So that term, the left one is actually fixed, so it's fixed in the update the theta and try to
approximate that.
>> Burr Settles: Right. You can also, if you have labeled instances then that's another term,
your typical objective function for training a CRG is just another term you can add to this. You
could add weighting terms that the labeled instances are more important than the regularizers.
>>: [inaudible] do that is basically because actually the right term actually [inaudible] so that’s
the prior you need to ask [inaudible]. So this is basically kind of regularization. We try to have
better weight [inaudible] unlabeled data at the same time try to make the probability of
predicting a label given the same feature>> Burr Settles: And match the expectations that the human provides. Did you have another
question?
>>: I think you answered it. Basically this can be good as a prior and if you have labeled it's
going to override it if you have enough labeled data but that is not shown on this slide, the
labeled data.
>> Burr Settles: But you could, and in this particular research we didn't actually have any
labeled data, we were only experimenting with labeled features at this particular. But for
engineering reasons we could sort of to do one or the other at the time but in theory, and
actually since then, we've been able to train models that are mixed in terms of what types of
labels are around.
>>: How does actually work for the sparse feature values? So the constraint like the date if it's
got four digits I think it's a date. If it doesn't have four digits I have no guess. So in this
conditional distribution when you're conditioning is the sum only over things where the feature
is present or is it>> Burr Settles: So the sum is over the feature rule and this term is over the expected
distribution of all of the tokens where that feature fires. However you can, let me think of how
to, so it's possible, for example, that four digits that are dates tend to show up in certain
contexts. This regularization term which is preventing over-fitting will start to push the
parameter of mass to things like what the previous token was and what the next token is as
long as the feature space is rich enough to include those kinds of features. So it's not just
memorizing that four digits is a date. This term is causing the parameter mass to sort of bleed
into correlated features so you might be able to extract, even without any other labeled data,
other types of dates as long as they appear in the same contexts. Does that make sense?
>>: So why was [inaudible] you needed labels otherwise you would not benefit from [inaudible]
but that was incorrect because you will have nonzero weight on the back of [inaudible] when
you have zero weight because it's going to defuse. So this is actually very cool.
>>: It seems like you have a separate set of features and they’re independent in some way or
separate. So to get this to work does it depend on the set of features that you have so that it
would actually work within the scenario you're talking about? Like it has a rich enough set of
features, so if I only had a date a conference title and I didn't have anything but previous or
next words then it just seems like if people was a date and if conference title and I would give
anything else. So can you comment about the features you need to get this to actually work?
>> Burr Settles: Yes. That's a good observation. It's not entirely true in this case because the
CRF there’s also some sequential information so some of the information might bleed through
the mark-off, the transitional probabilities that are implicit in the model.
>>: [inaudible]?
>> Burr Settles: Not explicitly. You could also even think about generalizing this to say that I
think when I see this label 90 percent of the time the next token is this label. Those are
probably harder for people to come up with but those are constraints that you could throw into
this sort of model. But I think it is true that having some local neighborhood features like just a
very large sparse representation just throw the kitchen sink, which is at least five years ago
what we were doing, was throwing the kitchen sink of features into these models and letting
the statistics sort of wash out, so having like previous token and next token really is what
enables you to make inferences about things where the feature doesn't fire. Does that make
sense?
>>: You've got to have those features otherwise it won't do what you said about spreading the
mass away from conference [inaudible].
>> Burr Settles: That's a good point. So to make this a little bit more concrete here's the user
interface that we built. It's a little clunky but it's research grade. So what we have over here on
the left is a collection of features that are being queried by the system. Over on the right-hand
side, so these are organized kind of by distributional similarities, so these are different features
that we think are related in some way according to the model. So you can select a particular
feature, like the word is Kaufmann, and then over on the right-hand side you see a bunch of
examples of that feature being used in context highlighted with a color of the corresponding
label that the model thinks it should have at this time in the training process. So in all of these
cases it thinks that Kaufmann is part of a publisher field and then up at the top you see the
expected distribution. So every time I see the word Kaufmann 95 percent of those cases I think
it's a publisher and then one percent I think it belongs to an author which kind of makes sense,
or maybe an editor field.
And going back to the previous slide, so this is what the reference distribution looks like which
is maybe imperfect, this is what the model’s current expected distribution looks like and the
regularization term is just trying to get these two pie charts to look as similar as possible.
>>: So does the system give you that word list and you as a user you choose from that and
label it? This is in some ways kind of like active learning, active feature labeling, right?
>> Burr Settles: It is in fact active feature labeling and the next question is how do we choose
which features to present because we are exhaustively generating all possible features. We
have an algorithm for generating the features and there's millions of them maybe. How do we
pick the one hundreds that we are going to show here?
>>: So you didn't do a case where a human teacher hadn’t typed in the feature and say I want
to label that.
>> Burr Settles: Right. So for this work we don't even exercise that case, but the next section I
talk about an approach where you can do that.
>>: So that KL distance, this is on a token level prediction?
>> Burr Settles: Yes. It's a conditional random field so a label assignment corresponds to a
state assignment in a path of the token sequence through the finance state machine.
>>: So can your Kaufmann feature inform Morgan being a publisher?
>> Burr Settles: If you have the feature. Maybe we don't have it here, but you could have the
feature the previous word is Morgan and the regularization term will force this to not just
memorize that Kaufmann means publisher, but a correlated feature with that is that the
previous word was Morgan and the transition probabilities from publisher to publisher are
perhaps higher than they are for other things so it kind of implicitly learns that Morgan might
have that label as well.
>>: I see. So with that effect in the modeling can you restrict yourself to unigrams and you get
away with most [inaudible]? So let's say there's a lot of Kaufmann that's not a publisher but
you leave the Morgan in there.
>> Burr Settles: I see what you're saying. Currently, like I said, one percent of the time it thinks
that Kaufmann is an author so there's clearly something else that's informing those decisions. I
can't speak to exactly what it is, but it could be that the previous word being Morgan is one of
the things that it's using as a predictor that 95 percent of the time it’s Kaufmann. This
particular sampling of things doesn't show any predictions other than publisher so I don't really
know. Now given like some of the things that I saw with the demo of Ice one thing that
interface like this could benefit from is okay, I'm going to select Kaufmann and author so show
me all the cases where we think it's author and that might really inform me there are false
positives or some cases will show me more like this or something like that. So that's another
whole mode of interaction that isn't accounted for in this particular paradigm.
Okay. So there's the question of how do we choose the right question? So ideally we would
like some complex but well-justified decision theoretic approach to select features to maximize
some expected utility. So if we get this feature all ten possible labels how does it minimize our
uncertainty about that but that's too expensive to do. So instead we use a more efficient
approximation which is just the expected uncertainty of all the tokens of the label distribution
for all the tokens where this feature fires weighted by some term of just how frequently it fires.
So we want to sort of an exploration, exploitation kind of thing. We want to explore things that
we’re uncertain about but also restrict ourselves to things that are likely to have impact
because they are occur frequently. So this is a heuristic that we tried. Here is results from a
simulation study with two benchmark information extraction data sets. So one was the>>: What is the question? Is it this a part of the feature?
>> Burr Settles: The question is here is a feature, label it for me.
>>: So you're trying to solicit the four digit regular expression and you have some corpus of
hypothesis, your bag of words, and you want to surface of some of the candidates one by one,
right?
>> Burr Settles: Or in this case we are actually showing a bunch of them. So the form of the
query is I want to pick which 100 words to show in this interface over here. Then when the
user clicks on one of those features I can see what the model currently thinks about that
feature and then I can provide some feedback like directed feedback based on my domain
knowledge that this feature implies this label.
>>: [inaudible] Kaufmann you might want to say this is a publisher.
>> Burr Settles: Yeah. So there are there are these buttons at the top. So I could click on
Kaufmann and then click on publisher. And can see some of these like eds implies editor, data
implies title, programs, parallel language, I guess they're all titles.
>>: Those are human labeled?
>> Burr Settles: Yeah. So the things that are in parentheses that are colored those are human
provided labels in this particular batch of 100 feature queries.
>>: So in theory you could have these slash D4 as another entry [inaudible] and the human can
say this would be here.
>> Burr Settles: So there's a couple different ways from an engineering perspective you could
do that. One is before any annotation is even done you could give the feature generator
factory or sub process a language to create these regular expressions. Another one is if we had
the facility to inject new feature labels we could just write the regular expression on the fly and
then have it retroactively apply to everything.
>>: So we can flip the grouping here? The whole group of five?
>> Burr Settles: Yeah. So this is a little less intuitive than I would like, but basically out of the
100 features there we run a quick clustering algorithm and kind of these are different clusters
of things that show up according to the features that co-occur with those features, so the idea
being these features tend to show up in the same contexts so if we present them together it
will reduce the cognitive load. They probably have the same label and so I can just kind of go
down the list within a particular bucket and hopefully they all have the same label and I can go
through them quickly.
>>: [inaudible] label on feature at a time?
>> Burr Settles: For this particular interface.
>>: Why would they have the same label because in some ways they're just like Morgan,
Kaufmann 1992. 1992 doesn't have the same year as Kaufmann.
>> Burr Settles: Right. So those two things probably wouldn't show up in the same cluster
though. But maybe, I'm trying to think of another publisher name, maybe the word press
would show up.
>>: [inaudible] Springer [inaudible].
>> Burr Settles: So it's an attempt without actually knowing for the machine to organize these
things. We didn't do too much research and engineering into the HCI side of this. They were
kind of based on intuitions but we didn't try many controlled experiments.
So this is the heuristic we are going to use. Here’s some experiments. We had two data sets.
One was the research citation running example, another was Craigslist apartment classifieds
trying to extract location, contact information, rent, utilities, etc. We compared combinations
of active versus passive learning and feature labeling versus fully labeling instances and we
evaluated the learner after 10 minutes using a simulated annotator which the details of the
simulator it was something that had already been used in previous research on feature labeling
with conditional random fields so the details are in that paper.
Here are the results. So after 10 minutes this is the token level accuracy of active learning
where the features were selected versus this sort of passive heuristic. We used Latent Dirichlet
Allocation to sort of cluster the features and we picked the most frequent features out of each
cluster so it's another hybrid. It's passivish but activish kind of but it was a previous approach
that had been used in these kinds of interfaces. Then this is active learning by selecting
instances using uncertainty sampling versus passive learning, so a random selection of instances
and fully labeling instances. And the cost functions here I don't have them on the slide, I don't
remember exactly what they were, but empirically in some user studies we found that it took
like 2 to 4 seconds to label a feature versus 10 to 20 seconds to label an instance. So that's how
we're estimating the time after 10 minutes of annotation.
>>: Just to be clear, instance label that means you actually have a [inaudible]?
>> Burr Settles: The whole thing. It was a fairly efficient interface where you clicked and
dragged and the whole token is highlighted.
>>: Does it update every time? So any label doesn't refresh with a new set of features?
>> Burr Settles: No. So you do all of the annotations and then click learn and then it retrains
that comes up with a new set of features.
>>: So in your simulations how many do they label in each batch?
>> Burr Settles: So in these simulations, it's been long enough I don't remember the details, I
think we kind of assumed that maybe there was an instantaneous refresh or something in
estimating this cost function which in reality is maybe if anything it's probably giving an
advantage to the instance labeling as opposed to the feature labeling.
>>: How do you select which feature to>> Burr Settles: Using the heuristic, the expected uncertainty times the log frequency.
>>: So out of that list of 100 you use that to select [inaudible]?
>> Burr Settles: Right. So we take the top 100 and then there's the oracle, the simulated oracle
that says okay, out of these 100 these are the ones that I want to label. So this experiment and
the next one, so the results are even more stark for the Craigslist classifieds. In some sense you
can't really trust of these because they're simulated experiments anyway and one does not
simply simulated annotators. So again, going back to if the cost function is the number of
instances maybe you get some great gains but we want to see what this looks like in a real
setting. So we ran a couple of actual user studies using that interface. So here are the results
for one user annotating the researcher citation data set. For this we only did two conditions, so
active learning for labeling full instances which is the green line versus active learning labeling
the features, and here we see in 10 minutes of work we get significantly more accurate
extractors using the active feature labeling paradigm and the same for a different annotator on
the same data set.
>>: So the features have been like previously defined instead of features?
>> Burr Settles: You can think of it as like a bag of words. Then there are some context
features and capitalization features, but yeah. The vocabulary of the feature space has already
been generated in these experiments.
>>: Is this starting from zero examples or did you do any study from some number?
>> Burr Settles: So if you notice these learning curves start at two. That's because there was
already two minutes’ worth of work done annotating just the most frequently occurring
features. That's how we started. Or maybe we started with the LDA approach for the first
batch.
>>: So if you already had like two thousand examples do you think you would be able to label
the remaining features in a meaningful way? Is it because you're starting off from Tabula Rasa
that you can come up with things?
>> Burr Settles: So my intuition, and the next section I have some evidence for this, is like you
get more mileage out of the feature labeling early on. So if you already have 1000 labeled
instances you may not get as much out of the, actually I'm not sure that's true. But I think this
initial spike that we are seeing in both of these cases partially comes from the fact that we
didn't have any information at all and we are providing these very targeted low hanging fruit
kind of domain knowledge about the feature labeled associations.
>>: How do you choose [inaudible]?
>> Burr Settles: I don't know. No more than 1000 because, no more than 500. It's maybe 100.
>>: So it's one per second?
>> Burr Settles: It takes about 2 to 4 seconds. So another part of the reason I'm hesitating a
little bit, what?
>>: That's very efficient.
>> Burr Settles: Yeah. The feature labeling in both this paradigm and the next section turn out
to be very efficient.
>>: But do they spending a lot of time like deciding which ones to [inaudible]?
>> Burr Settles: This is the actual time that they spent. This isn't a simulation. So I don't know
how much, I don't think we went back and looked at how much time they spent doing one thing
or the other>>: [inaudible] second?
>> Burr Settles: I think it was 2 to 4 depending on the data set. One of them was two seconds
per annotation and when you take the amount of time that was spent and the number of
annotations that were done divided by the amount of time was spent turned out to be two
seconds I think for this data set and four seconds for the other data set because it was two and
four for one or the other.
>>: So there's a UI question, right? So in this [inaudible] show when a user is labeling features
then do I run other things to label?
>> Burr Settles: That shows my 100. And then you kind of have to click retrain and wait for the
next batch.
>>: So whatever I labeled would not appear here again because those are features that already
received [inaudible]? A fresh grid of 100.
>> Burr Settles: In this interface you can't go back and revisit them either.
>>: So does that mean if it’s that fast then users are probably not looking at the example’s
contexts that much, right? Because to click on an example, sort of see what the system thinks,
and then either change their mind or label it would probably take longer than two seconds,
right? So we are probably maybe just using their own>> Burr Settles: [indiscernible] context that’s true.
>>: This is good. The features here are recognition tasks. They're not generation tasks. The
users don't need to think about words.
>>: Right. But you might want to see the context to know what your label>>: That’s true.
>>: [inaudible].
>>: It's interesting because the clustering you have actually gives a type of context in a way so it
may be that that's actually why you can go fast because of that clustering.
>> Burr Settles: Yeah. You don't necessarily have to look at the empirical context.
>>: So what is the first table you see? When you have a zero label because there maybe not
word in there that you ever see in this addition.
>> Burr Settles: I don't exactly remember what the first batch is, it’s either the 100 most
frequently occurring features or we ran this LDA, the Latent Dirichlet Allocation and then
picked>>: So why would you get anything that is multi>> Burr Settles: You're hoping that by picking the most frequent or representative features that
hopefully they have some predictive value.
>>: With zero labels?
>> Burr Settles: Even with zero labels.
>>: For domains the high-frequency words are probably>> Burr Settles: Like proceedings shows up all the time.
>>: You're only seeing citations. [inaudible] if it’s not generic.
>> Burr Settles: Right. So everything that you're pointing out are definitely limitations of this
particular implementation of the idea in general. Did you have another question? Then these
are the learning curves for third user on the Craigslist classified data set. So here even the
active feature labeling is winning all along.
So here's the sales pitch sort of slide that this is a new framework that we devised that
outperforms passive learning with feature labels and active and passive learning with instances
but there's a dirty little secret and you guys have been hinting at it all along. There's a few
different dirty little secrets. So one is that the experiments are not exactly interactive. So what
happened was the users worked on a particular batch for two minutes and then they waited for
the CRF to retrain which took about one half hour or an hour and they went ahead and did
something else and then the interface would ping them when it was ready and that they would
work for another two minutes and so on. So those learning curves, the reason they were kind
of jagged lines that went exactly from two-minute intervals is because they weren't exactly
contiguous. So we were lying and cheating a little bit.
>>: It's kind of shocking that labeling instances would perform so poorly after labeling features.
I wonder if it's an artifact [inaudible] CRFs being very hard to train.
>> Burr Settles: I don't think so because those results are consistent with the next section
which was using a completely different model on a completely different kind of task.
>>: Even for just document classification do you think that the [inaudible] labeling feature
would still win?
>> Burr Settles: Yes. I have evidence that it still wins. But some other limitations of this work is
that the human can't volunteer the feature labels so you're restricted to just labeling the
features that are queried. So ideally we would like even Tabula Rasa before anything starts to
be able to think the word proceedings it didn't show up on my grid but I want to tell the
algorithm that the word proceedings implies book titles. And we can only label features. It
would be nice if we could mix the annotations of both features and instances. So because of
this we are not quite A plus. We are at a B better than average kind of paradigm but you're not
all the way there.
So now I'm going to switch gears and talk about this system called DUALIST that I built for
interactive text classification which is an attempt to address basically all of those previous
limitations.
>>: Before you do this imagine a way to build a classifier for basketball.
>> Burr Settles: And it documents whether this document is about basketball?
>>: So at first the word football is positive compared to the rest for basketball because football
and basketball may appear together or something like that. But after I have the word
basketball that football becomes negative. So what happened is the feature can be positive
first, then I put another feature. The difference between the two sets now becomes negative.
Then I put another feature that explains this negative [inaudible] because positive again and
what happened is that since I'm combining the features in a very nontrivial way the human
intuition gets completely off as whether it's positive or negative given the other features. Now
every single feature we can’t predict information. We can really not predict which direction it's
going to push in the context of other features. Is that a problem?
>> Burr Settles: It could be a problem. This gets back kind of, I think it was Jerry that asked a
question about being able to revisit feature labels and changing your mind later which>>: But this is not something [inaudible] possibly understand. But if he selects the feature
football you can say it's negative and the constraint is that in the margin on your unlabeled data
set that your model predicts that more of these are not basketball than not it’s saying anything
about the weight of the thing in your model.
>>: You really want to go back to the conditional though. The conditional given just football is
positive. What you're trying to differentiate is the conditional given football and basketball and
that's different which you can potentially ask about in context things but that's a different
judgment then with overall [inaudible], right?
>>: Right. But the information that the user provides turns out how much information am I
getting? At some point it becomes, so that's one approach. Then I look at another approach
[inaudible] and I look at the weight of the different things and they were things that I would
have guessed the word positive because by themselves they would be positive and then I look
at the weight and the bag of words are in the negative and clearly the two things are carrying
different intuitions. One is in context of everything else but the humans can only reason in
very, very limited context.
>> Burr Settles: So you can imagine the>>: [inaudible] providing the wrong information just by working in context.
>> Burr Settles: So you can think about the human being able to understand that if this word
appears and this other word does not appear or if they both appear that's a different part of
the if clause of the if, then rule. You can imagine that the active learning submodule of this
system being able to notice these discrepancies empirically when the cross product of this
feature being present or not, so there’s these four different conditions, the label distributions
are different and maybe being able to synthesize a conjunction or a disjunction kind of rule and
then asking you about that rule. This is something I'm just thinking about off the top of my
head now that would help. It would increase the size of the feature space and make it more
complex and the search for generating good candidates for that might be really difficult, but it's
possible. It’s something that you can do to help with the situation that you're describing.
>>: My experience is that it works for the first ones and then>>: So I have a different interpretation of this which is your intuition of the sign changing is the
sign of the [inaudible] weights in the model. But if you look at the KL term that's actually in the
model they are not directly constraining the sign of any coefficient. It's a very different kind of
regularization. It turns out that will take care of the issue you mentioned.
>> Burr Settles: In the case of a logistic regression with that kind of term that's true.
>>: You could have a positive feature to a negative weight in your model.
>> Burr Settles: Because all of these other features are overwhelming the prediction.
>>: What it means is that at some point [inaudible]. You should suggest a feature [inaudible]
weight what information>> Burr Settles: Now it could be that just modulo everything else it’s a positive predictor but
you have so much other information either through the labeled instances or through all of the
other labeled features that it kind of overwhelms that model. These are the real reasons that is
positive and this other positive label is really doesn't contribute to the terms.
>>: So you're saying that the user suggests a word, the generalization error increases, but
decreases the moment it gets better even though the weight is still negative?
>> Burr Settles: Right. It could be that if the weight was zero it's still positively correlated but
modulo everything else, but the objective term would change hardly at all if you just took the
feature out completely. But again, maybe we can take this discussion off-line, but I think Jerry
was going in the right direction where at least for the case of the discriminative model and this
way of incorporating the domain knowledge into the regularization term it kind of takes care of
itself. It might be even more of a problem for this system which is called DUALIST.
So this is a task classification system. I don't know what's going on. Well, that's unfortunate.
Oh, there it is. Sorry. So this is a quick demo of the system. Where did I put this? There it is.
Data. I should have been more prepared.
So this is in a sentiment analysis example. So what the interface is doing, let's see if this messes
things up to make it bigger. So what we have here on the left are actual documents which
currently, because nothing’s been labeled so far so the documents on the left are randomly
selected, and these are movie reviews from the standard sort of sentiment analysis task, and
we have two labels here: positive and negative and right now these are features. So the bag of
word feature representation which right now are just ranked by their frequency in the corpus.
But before we even do anything I can think of the fact that positive reviews great and terrific
are predictive of positive movie reviews and maybe bad and terrible are predictive of negative
movie reviews and I can just click this button and it's already Incorporated that into the model
and now it's asking me using uncertainty sampling about some more documents over here on
the left and a rank documents that it wants to ask if these are words that are predictive of
positive and negative reviews. So if you scroll down let's see uplifting might be associated with
a positive review, Oscar, good Oscar contenders, perfectly, provocative, compelling, wonderful,
superb, powerful, magnificent; on the negative side we have waste, like this was a waste of
time, stupid, laughable, dumb, pointless, ridiculous. So so far we haven't even labeled any
documents. Worst, SQL, and maybe Schwarzenegger is maybe associated with-
>>: Why are there fewer candidate active words?
>> Burr Settles: So this is actually kind of an artifact of the way I'm picking things which, a
differently could pick things. In a couple of slides I'll go into the algorithm,
>>: [inaudible] select a positive box>> Burr Settles: The selected positives are at the bottom. So you can actually go back and
revisit and change your mind about labeled features in this particular case. Jean-Claude Van
Damme is maybe a negative predictor, awful, subtle, anyway you get the idea. So far we
haven't even the labeled any instances. And we can look at, let's pause and look at the
predictions. So this is again a research grade interface but over here this is the predicted label
on the far left, this is the confidence of the model of that label, and then this is the actual label.
So you're starting to see there's a few mistakes like we are predicting this document, this
review of Species 2 we think is negative when in fact it was positive, but in general you'll see
there's a lot of parity between the predictions and the actual labels already even though we
haven't labeled any instances and we haven't labeled that many features. And then we can go
through>>: [inaudible]?
>> Burr Settles: No we don't because there's sort of two modes. There's explorer mode which
is what I'm using here which you don't need labels to be able to use the explorer mode.
>>: What I mean is you have all those yellow things highlighted. That's for our human benefit
but this in no numbers on there.
>> Burr Settles: Right. If I tried DUALIST and there's also an experiment mode where it requires
labels then we could get an actual accuracy estimation.
>>: [inaudible].
>> Burr Settles: Those are the true labels.
>>: So this is above 90 percent accurate?
>> Burr Settles: Maybe. Actually if you want to bear with me a minute I can very happily, using
command line, I can kind of the figure out what the confusion matrix is.
>>: I think it's about 90 percent.
>> Burr Settles: It's probably, I'm not sure it’s that high but>>: [inaudible].
>> Burr Settles: That's an artifact of the underlying model which is naive-bas. So let's jump
back into the talk and I'll talk about how this actually works. So it's a multinomial naive-base
because it's been known to work well for a lot of text problems, it's basically linear training
time, I'll skip over the actual modeling. So here's a cartoon illustration of how the learning
algorithm works. So we have our vocabulary of words, we've got the two copies of each of
these features one conditioned on it being negative, one conditioned on it being positive and
then we start with your typical symmetric Direchlet prior or La Plas[phonetic] pseudo-counts of
the one just for smoothing.
So if we have a labeled feature, so I can inject some of knowledge by just typing in a word on
this particular label, what we do is we just pretend like we hallucinated five examples where
the word bad occurred in a negatively labeled document and fantastic in a positively labeled
document. So we update these counts with the labeled features like so and then if we have
labeled documents we can do a sort of variant of maximum likelihood estimation. So we have
all these negative documents, we can update the counts for the words that appear in those
document on top of the priors that came from the labeled features, the same thing for the
positive documents>>: What’s the total number of documents?
>> Burr Settles: It depends on the data set and what we were just doing I think it's 2000
documents. It's not that huge, but this is still linear training time. I've used it on>>: That means five comes from [inaudible]?
>> Burr Settles: You mean the pseudo-count? So it turns out it's not that sensitive. So I ran
some experiments for different values of this and it turns out invariant of how big the actual
data set is it doesn't matter that much what this pseudo-count is as long as it less than 100. For
some reason I don't fully understand theoretically why but if it’s less than 100 it seems to work
pretty well. And then from here we can actually do one step, not to full convergence but just
for the sake of speed one or two or three or finite number of steps of the EM algorithm where
we can probabilistically label all the unlabeled documents and then fractionally update the
pseudo-counts depending on the probabilistic wavelengths of the unlabeled documents and
then re-estimate the parameters based on that and that is our final classifier; and then using
that classifier we can do the active learning steps by scanning through all the unlabeled data
sets, selecting the most uncertain ones, and then also scanning through all the features and
selecting the ones with the highest expected information gain, so the ones that have the largest
difference between the expected uncertainty minus the expected uncertainty of the documents
that have that feature present.
And then what I'm doing is ranking the top 100 or 200 of those and then organizing them into
the columns by the model's expectation that this one occurs more frequently with this versus
that label and so the reason there was this imbalance after a couple of iterations was just
because the most informative features tended to be more correlated with the positive label but
you could also do a ranking, choose a slightly different algorithm that would keep those more
balanced. So I'll discuss some more user experiments.
>>: [inaudible]?
>> Burr Settles: Just to keep things fast and tractable and not necessarily run to convergence.
You could also do two, three, four. But the important thing is that it's a finite number of steps.
Another thing is I think empirically I played around with this and because the model is so
strongly biased when you're labeling features it can rarely run off the rails if you do more than a
couple of steps. So there's two reasons: one for speed and the other just to keep it from
believing its own press too much.
>>: [inaudible] train the logistic regression on your expected counts to see if the probabilities
[inaudible]?
>> Burr Settles: I'm not sure what you mean.
>>: So you do an [inaudible] step. So for every document you have a fractional label [inaudible]
so you could use that as your external data for your logistic regression.
>> Burr Settles: Oh, I see what you're saying. So kind of using the naive-base as a proxy. So I
suspect that probably won't work that well because the posterior distributions from the naivebase classifier are still overly polarized. So Patrice was pointing out that those confidences
looked a little scary.
>>: So that's a problem I was trying to address is that if I have so many positive words, so by
retraining with logistic regression you could avoid the double counting and [inaudible] makes
your probability crazy.
>> Burr Settles: Right, right, right. I think what might work better is actually trying to some
other approach instead of naive-base we are just actually using logistic regression and then
incorporating either using generalized expectation like a regularization approach or something
that is a little more computationally efficient and you can do incremental updates. That would
be better. But this work is maybe three years ago and I haven't really been working on it since
then. That's a direction I wanted to go at the time for sure.
>>: Very quick comment on this. Your [inaudible] prior with [inaudible] of number five is that
equivalent to have a pseudo-document with [inaudible] wording [inaudible]?
>> Burr Settles: No. It's actually, actually I guess it is. Or five documents that have that word
once. That's true because it's a multinomial it’s the same thing as, yeah.
>>: Is that the same thing? Five positives with four numbers>>: You create five pseudo-documents. Each one contains only one word and so forth five
times.
>> Burr Settles: That should be the same as one document that has nothing but the same word
five times in the case of the multinomial logistic regression event model. If it's a binomial event
model then it's different.
>>: So I think the most [inaudible] in very short [inaudible] you can get absolute [inaudible]. So
this value [inaudible] the thing I'm wondering about is does this get you a reasonable accuracy
superfast but if you were to compare it with just building a model with labels in the traditional
way how far will [inaudible]?
>> Burr Settles: Hold onto your seats. So here this is a more exhaustive set of experiments
than the previous paper. So here we used three different data sets with five different
annotators labeling those data sets in these three different experimental conditions and the
order of the conditions was randomized to reduce some kind of presentation bias or learning
about the problem. So we used DUALIST in the flavor that I just showed you versus active
learning on documents only, so this is no feature interactions at all, versus passive labeling of
documents only. So we randomly selected the documents and documents are all that you can
label. And we ran these in six minute trials with across all, to keep things consistent there was a
fixed 90 percent unlabeled pool because they could actually label things differently than the
gold standard but we pulled out the unlabeled pool at 90 percent and then a fixed 10 percent
test set.
So these are the learning curves for 4 different distinct users on the same the problem. So this
is the classification of university webpages into faculty student research project or courses. So
there are four different labels. The red line is DUALIST; the green line is active learning but
documents only. So the first thing to notice is that both the active learning sort of paradigms
are better learning curves then passive and the cost function here, the X axis, is actual wall
clock time with no pauses in between the retrainings is actual contiguous six minutes of work.
The dotted line at the top is the accuracy of the classifier if we trained on the full 90 percent
label training set. So in all of these cases we're getting within 90 percent of the upper bound
accuracy within six minutes with DUALIST.
>>: So what's the X axis on the top?
>> Burr Settles: It’s in seconds. So this is 360 seconds.
>>: So what's the [inaudible] bottom and the top?
>>: Four different users.
>> Burr Settles: These are learning curves for four distinct users on the same problem.
>>: What was their interface for the active labeling and passive labeling?
>> Burr Settles: So it was the same as DUALIST it just didn't have the columns on the right for
labeling the features.
>>: So you would hire the top right user?
>> Burr Settles: This person seemed to get it right and just really nailed it out of the park.
>>: So at the end of six minutes how many documents did they managed to label slash features,
right? And you said that dash line is 90 percent of the upper bounded size which would be like
1000 something.
>> Burr Settles: So in this particular I think there was 1000 documents, there are 4000 total
documents.
>>: 3000 documents.
>> Burr Settles: So this is training like 32,000ish labeled documents.
>>: But for six minutes how many documents can people label?
>> Burr Settles: That's a good question. I think I might discuss that in the paper but I don't have
it on a slide.
>>: [inaudible] less than 100.
>> Burr Settles: It's not that many. So these are the results for one data set. We get similar
results for this science data set, so this is one of the 20 newsgroups subcategories with four
different labels. I don't remember exactly what they are but again, we can get it within about
90 percent accuracy of the upper bound but the gains are even more stark.
>>: Is the upper bound still [inaudible]?
>> Burr Settles: It's the same model. It’s just trained in a traditional supervised way.
>>: So you said you're not testing against some ground trick you’re testing their own labels?
>> Burr Settles: No. So the test set is consistent for every align you see and it's the ground
truth gold standard label that came with the distribution of this benchmark data set. But it is
possible that a user in one of these conditions gave a label that disagreed with the ground
truth.
>>: What percentage of data is actually about science?
>> Burr Settles: So it's all about science there are just different fields of science. So one is like
astronomy and one is, I don’t remember what the four categories are.
>>: So you go with the most prevalent categories of science? [inaudible] the most common?
>> Burr Settles: I think they are somewhat balanced in this particular data set.
>>: [inaudible] labels and labeling documents and features?
>> Burr Settles: They can inject features, they can a label>>: [inaudible] how frequently they use the two?
>> Burr Settles: Right. So in general, I think I have a slide in here somewhere that visually
shows it, people tended to start out, actually let me see if I can just to find that slide because,
here it is. So at the beginning it's fairly distributed, this is the amount of time or the number of
actions that they did say the first minute. It's pretty evenly distributed between labeling
documents, labeling features, and volunteering features, so just typing in brand-new features.
And over time they do less volunteering and less feature labeling and spend more time labeling
documents.
>>: [inaudible]?
>> Burr Settles: Probably because the low hanging fruit has been picked.
>>: [inaudible] feature’s work may be as good after a while?
>> Burr Settles: After a while the features stopped being that interpretable maybe.
>>: It would be interesting to see if you have just have people just label documents and then
show them features and see if they get added value at some point. Is it only valuable in the
beginning or is it also valuable [inaudible]?
>> Burr Settles: Yeah. So somebody else asked a question with the previous section and that's
a good question. I'm not really sure. I think there would be still some utility but it would be
less. I mean there's diminishing returns in general with all of these things.
>>: Does the user have any sense of progress? Can I tell that I added a feature to get better or
worse? Do I have any idea?
>> Burr Settles: With this interface, no. So, like some of the things that you’ve built into the Ice
System are really good like the visualization of the spectrum of predictions and whether or not
they're actually positive and negative and you can kind of visually see where the false positives
and false negatives are and then go in and try to understand why. I mean none of that is part of
this current interface.
>>: Why would anybody volunteer a feature? What's my motivation for doing that? Why
would I do it in the first place?
>>: They don’t know the impact you mean?
>> Burr Settles: It’s a type of domain knowledge that you have that you want, you come to the
table kind of knowing this about the problem.
>>: [inaudible]. I mean I can give a label or I can give a feature or I can type something in and
I'm not really getting any direct feedback of that was great or that wasn't so great so I have to
make a decision on a policy without really>> I think the question is how do you make the teacher better?
>> Burr Settles: How do you inform the teacher more?
>>: So another way to interpret the curve is the best is 98 percent You get to 90 percent pretty
quickly which is five times the error rate [inaudible] is not that great. You can say oh, it’s 90
percent accurate>> Burr Settles: It’s five times the error rate in six minutes as opposed to two weeks.
>>: I mean there are many problems for which [inaudible] 98 percent one is usable and the
other isn't.
>>: I think another way to ask the question about this is you did this for six minutes. If you add
another 10 minutes what would you do? Because you're only running this for six minutes.
What would you do to get it better in the next 10 minutes?
>> Burr Settles: Another artifact here might be that you're maybe a little bit limited by the
classifier. So the next slide I was going to show was>>: So is the 98 percent the human performance versus [inaudible]?
>> Burr Settles: The 98 percent, this upper bound, this is using the gold standard labels for
everything. So the results here were less amazing. DUALIST still seems to be consistently
better and this user apparently was doing something right. But this is a trickier problem
because [indiscernible] analysis is tougher than some of these other clustering. The other data
sets were in some sense a little more clustery so that generative naive-base one captured it
better whereas you’ve got much more subtle and non-conditionally independent features going
on with this particular data set.
So it could be that swapping out the paradigm for a logistic regression classifier would actually
do much better here. And like what you were getting at, facilitating some information back to
the human of what was actually learned and what actually went into making different decisions
would be helpful in deciding oh maybe I'm going to undo this feature label, this kind of domain
knowledge. So it's almost 5 o'clock so I could either stop now or I could briefly power through a
couple of other salient points. How do people feel?
>>: Power through.
>> Burr Settles: Power through. Okay. So here's an interesting failure case of basically a user
who didn't label any features until right there and then the accuracy started to jet up. So
basically on the movie reviews data set there was almost no difference in behavior between the
three preconditions so we didn't really see any gains.
Here was another one where DUALIST started out looking good and then sort of flat lined and
went crazy and it's because the user, this was for the WebKB corpus and this particular user,
basically after two minutes or so of labeling the features that were queried just kind of stopped
making sense and the only column that made sense was the course labels. So they just spent
all of their time in feature labeling land labeling those features and not any of the others and
there wasn't a good mechanism in the training algorithm for this at the time to account for that
in-balance of the feature labels. So it started running off the rails and the active learning with
documents turned out to be better.
This is an open source sort of thing you can download and play around with. So these are some
of the strengths but there's this question of why it’s not always better and you guys have been
getting at it. So we are A minus. We still haven't gotten all the way to an A plus. So can we get
to A plus? I'll just power through this. So Jerry and I three years or so ago tried to replicate, we
used the data for a few different things, but one of them was to try to replicate these results
using regular people or turkers instead of all the users, I have a confession, all the users in the
previous studies were graduate students in natural language processing and machine learning.
So we kind of understood the problem a little bit more then in general.
So the idea is can we replicate the user results? And maybe not. So we had 32 or 33 different
turkers in these different conditions try to train things and basically just the learning curves
weren’t statistically significantly different. However, if you look at the final accuracies, the
distribution of final accuracies of the models after I think these were 10 minute experiments,
clearly some of the users who were given the DUALIST interface were doing better than the
others. So the question is what were they doing that helped?
So let's review the different things you can do. You can label documents, you can label feature
queries, and you can volunteer features. So these are the three different kind of first-class
activities; and it turns out that volunteering features that behavior seems to be really predictive
of the final accuracy at least in this really short sort of 10 minute restriction. And some
evidence for this, so we went through and out of the 30 or something individuals in that group
we found these three behavioral subgroups. So 11 of the 33 didn't label any features at all.
They only labeled documents. They didn't volunteer features, they may have labeled some. So
the small group that labeled a lot of features and then another kind of a third or so of them
labeled a few features. I forgot what the cut off was.
>>: [inaudible] type the worst that’s already in the list did you count it as volunteer feature or
main feature even though it's already running?
>> Burr Settles: It's already on the list. Basically if it's in the list but not labeled in the list then
that counts as volunteering it. If it’s already in the list and labeled then it's sort of a duplicate
and we throw it away. Does that make sense?
>>: Maybe when you go home today you can also [inaudible] faster because you don't want to
[inaudible].
>> Burr Settles: You don't have to scan through the list.
>>: There's also the number of features created, right?
>> Burr Settles: Right.
>>: Let me just clarify. So there was this word like excellent in the list and I didn't read i really
carefully but I thought about excellent.
>> Burr Settles: And you typed in excellent?
>>: Yes.
>> Burr Settles: That would count as volunteering a feature even if it was already asking about
that feature. That's not the mechanism through which you provided the information and so
then if we look>>: [inaudible]. This is the only modes that can be in active learning. That was the point I was
trying to make but you don't have enough time.
>> Burr Settles: Anyway, so these are the learning curves for the people in those conditions,
and here we see some statistically significant gains; and this is a linear regression to predict
final classifier accuracy as a function of the number of times people did these different actions
and the only want to have a statistically significant effect is whether or not they volunteered
words.
>>: Can you show [inaudible]?
>> Burr Settles: On this?
>>: Yeah.
>>: I think increase is a good one. It could be explained by something other than the group of
things you had listed there. It could be the number of the features that they labeled and by
typing they could label more.
>> Burr Settles: So that's a possibility. It may also be>>: [inaudible] list there for your task.
>> Burr Settles: I kind of suspected that didn't happen very often, but it may have. Another
explanation is that there's a hidden variable here that maybe they're just good labelers and the
good labelers happen to also want to take advantage of this behavior. So it's correlated but not
[indiscernible]
>>: I doubt it's the difference of the number of labels because everything is on a time axis there
and it would be surprising if turkers were all that fast as typists but really bad list scanners. It
would be weird. It might be faster to type then not scanning the list. It wouldn’t be consistent
with any other scanning type study.
>>: They probably scan the list and select all the ones that make sense there and this is
measuring that stuff that they're adding to it, right?
>> Burr Settles: Well, all of the actions are included in this cost function.
>>: I think there is a difference though because at some point you’re not finding new words and
you have to scroll and scroll before you find all of them as opposed to when you type then you
type one word instead of reading 100.
>>: There's a real informational difference but I don't want to take his time.
>> Burr Settles: Those were actually the main points I wanted to make. Maybe we can still get
to A plus, but there's a lot of things to learn and I think it’s inherently disciplinary in that we
want to, it takes research in machine learning to figure out algorithms that can pose questions
in these different mixed tradition of modalities, make use of the human domain knowledge,
this is the next bullet point.
So in human computer interaction there's both interface design which I haven't done as much
of and it sounds like you guys have been doing a lot more, helping the human to actually
understand what's going on in the model like actually, understanding what the machine has
learned, as well as most of the work I've done is facilitating the communication from the human
into the algorithm in these different ways but not so much understanding what the algorithm is
learning being able to go back and debug in the artifacts. And then there's also this human
behavior aspect of what kinds of things to people tend to do, how do those impact efficiency,
are there ways to design the interfaces to encourage those activities? Because if you end up
with these richer multimodal mixed initiative interfaces now instead of one thing the human
can do 10 different things and so how do we optimize getting the human to do the right things
at the right times? So that's kind of all I have at this point. Only eight minutes over.
Download