>> Xiaojin Zhu: It's my great pleasure to introduce Burr Settles. Burr graduated with a PhD from Wisconsin Computer Science department; then he did a postdoc with Tom Mitchell at Carnegie Melon; now his most well-known famous book is a survey book on active learning, and he is also very famous for writing songs and performing an actual event. Okay. So today he's going to tell us about interactive machine learning. >> Burr Settles: Thanks Jerry. It's a pleasure to be here. I think I'm probably more famous for active learning than for my band although sometimes I wish it were the other way around. So it's a pleasure to be here. It's been fun to talk to a lot of the faces that are here today and learn about what you guys have been doing. So a lot of this talk is definitely preaching to the choir and it sounds like a lot of things that you already know, but bear with me. Hopefully it's not still boring to you now, but I'll start out by talking about the typical passive machine learning paradigm. We’ve got these three elements. There's a human expert who knows something about the problem and a machine learning system that you want to train to automatically solve the problem and then you’ve got the data which typically starts out as an unlabeled data and there's a bit of a bottleneck problem in that there's a lot of the data. If there weren't a lot of it we probably wouldn't want to be training machine learning systems to solve the problems for us. So what got me interested in this is when I was a graduate student, let me switch over here, one of my first projects was this information extraction system for biomedical texts. So this is a little GUI demo which amazingly still works after a decade of not touching the code. So the task here is given a biomedical journal article it will automatically extract things like genes and the proteins that are coded for by those genes and cell lines and cell types and things so they are very related problems that you guys are working on extracting, names and addresses and phone numbers or people in organizations and locations out of newswire text. However, this was a less popular and less studied problem than those other things and so there was a paucity of data for them so the labeling bottleneck was very real and that got me interested in how to kind of optimize that and make better use of the human’s time. So if we were to assign a grade to this paradigm we have machine learning student or something then you’d probably give this a C plus, but since we're computer scientists maybe a C plus plus. I spoiled my own joke. So the question is can we do better? And I take inspiration from this quote: Computers are useless. They can only give you answers. Maybe we can make them more useful if they'll ask questions as well. I don't know if Pablo Picasso actually said this but the Internet says so I can't find any definitive evidence that he said it. So this informed the active learning paradigm where now we’ve got this cycle rather than just humans selecting some amount of the unlabeled data to label it and hand it off to machine learning algorithm. Now we can label some small subset, give it to a machine learning algorithm, then that algorithm can inspect the swaths of unlabeled data and hopefully in a targeted way make better use of the human’s time. So here was an interesting maybe negative result from my thesis work where, I don't know if I'm not supposed to stand in front of the light so ignore this chart for now. This is a typical learning curve where you have the number of instances in an information extraction problem and what we would like to see is what you're seeing here that with less labeled data we can get higher accuracies or in this case F1 measure in the active case versus the passive case. So this was for a real user experiment that I did on this biomedical name identity recognition task. So this is what we hoped to see and it’s pretty typical. However, since it was a real experiment it was actually logging the amount of time that the human spent on the tasks. It was interesting to compare actual wall clock time. So this is the exact same learning curve but where the cost function is different it's actually time and not necessarily the number of instances. It turns out the instances that were hard for the machine to label, and so it wanted more information about, turned out to also be hard for the humans to label and so the actual gains kind of washed out because we weren’t taking the right cost function into consideration. So this inspired sort of a variant of active learning that I was calling interactive learning. Interactive learning is maybe too broad and it can be interpreted a lot of different ways. So like machine teaching or machine coaching or something like that is maybe more along the right lines. But the idea is to treat this process more like a conversation between the human experts and the learning algorithms. So we can let the human show some initiative by offering some advice about the problem. So we go into the problem presumably with some domain knowledge, maybe it's not perfect but we have an idea of what characterizes the problem, and let the machine show some initiative by a few different things. So one is asking questions or active learning, and another is attempting to teach itself from all of the unlabeled the data that it has available and then maybe that can inform the questions that it asks and the types of domain knowledge that it tries to elicit from us. So this is a very mixed initiative paradigm in that both the humans and the machines have some agency and both have multiple ways of interacting with each other. So I'll make this a little bit more concrete throughout the rest of the talk. But what this might look like in this interactive machine learning or machine coaching sort of scenario is before any labeling has been done just Tabula Rasa. The human expert has an idea so this is an information extraction task of extracting bib tech records out of actual citations in a document. So you think the word proceedings that's a good predictor of the book title field. So this is a rule that I can just come up with. Another one might to take a slightly different form, I don't know why it's not showing up on this screen, there we go. Another one is I can maybe define a regular expression of four digits in a row. That seems to be a predictor of dates. It could also be things like page numbers. So the advice doesn't even need to be perfect but it needs to be kind of pointed in the right direction. So the language of advice we could give, here's a repeating sequence of initials that is probably indicative of the author field, and then I can come up with these ideas and then hand them off to the machine learning algorithm. Then it can take this, so far no instances have even been labeled per se, and the algorithm can inspect large amounts of data and then through some semi-supervised learning try to explain what it sees down here given those rules as best as it can, and then ask questions like there's this word conference that appears in the same neighborhood as proceedings. I think that might be the book title field or can you label this feature for me as well? Or maybe everything we've seen so far have been conference proceedings. I don't know what a journal is. Maybe you should annotate this journal citation for me. And then the human can turn around, answer those queries given to the machine learning algorithm, and this whole process can repeat. Ideally the human can even come up with more rules or advice domain knowledge at any time and inject those into the system. >>: You have no labels here? At what point did you inject labels? >> Burr Settles: At this point. When the algorithm asked for me to label this citation. >>: So there's a few [inaudible] the instance of that field is the segment? I still do not understand what the task was. Sorry. >> Burr Settles: So the task is given a citation, segment that citation into the title field, the author field, the year, the publication venue, so on and so forth. >>: So I guess I'm confused by at which part do you specify the features and which part do you specify the fields. >> Burr Settles: So my thinking and the sorts of ways that I have been approaching these problems is a little bit different than it sounds like with the Ice System how you guys have been approaching things. So you can think of a bag of words of features and all the features just already exist so there's no feature selection or feature generation. >>: [inaudible]? >> Burr Settles: Right. But we can label the features themselves and I'll talk about the machine learning algorithms for doing that in a few different scenarios. So part of the mixed initiative aspect here is that we can not only label instances, which in this case means segmenting and chopping up the citation into the appropriate fields, but you can also label features and say when I see this word or when I see this capitalization pattern or something those are predictive of these labels. >>: [inaudible] segments [inaudible]. >> Burr Settles: Right. So it's kind of a domain knowledge rule that you can provide without even being linked to any specific example. >>: And the constraint there is that this word will only appear in the positive? >> Burr Settles: Not necessarily. So the rules could be imperfect. There could be noise in them, but hopefully the semi-supervised step would help it to sort of tease out the context under which this rule really applies versus when it doesn't apply. >>: [inaudible]? >> Burr Settles: So in the case of conditional random fields to do this extraction you can think of the algorithm that we use for incorporating this domain knowledge as a regularization turn. And in a few minutes I'll actually go into the details of how that works. So there's a couple of different research papers that I would say falls into this paradigm, but before diving into those I would argue that this plays to our complementary strengths. So machines are relatively inexpensive in terms of their time, their high bandwidth, they don't complain to go through hundreds of thousands or millions of examples where as at least I would, but as far as knowledge and intuition about the problem machines are kind of dumb. They’re brute force but if they already knew how to solve the problem we wouldn’t need to teach them. However, actually that's where we have our strength. So the active learning helps to reduce the expense of the time spent on the task. High-bandwidth semi-supervised something plays to the strength of the machine to be able to crunch through lots and lots of unlabeled data, and the advice giving part of the dialogue helps us to take advantage of our knowledge and intuition about the problems. So if we can build a paradigm that incorporates all of these things then that paradigm is maybe an A plus as opposed to a C plus plus paradigm. So I'll talk about these three different research projects over the years. Active feature labeling, a system that I built a few years ago called DUALIST, and then briefly if there's time at the end I'll talk about a collaboration actually with Jerry about looking at DUALIST and kind of a crowd sourcing setting. The first two are focused on mainly user interfaces and modifications to machine learning algorithms to facilitate this kind of dialogue; and the third section is kind of taking the camera, pointing it back at the humans to understand how the human behavior in this kind of system what differences that makes in ultimate accuracy of the systems. So this was an EMNLP paper in 2009 with Andrew McCollum and Gregory Druck, who at that time was at University of Massachusetts Amherst. So this is a sequence extraction problem. So going back to the citation extraction we can think of the condition random fields as a probabilistic finite state machine. So if we have this citation here the extraction corresponds to the most probable path through this model where the states correspond to the different labels. So M.J. and Schervish are both self-transitions on the author state Theory of Statistics. Springer, 1995. Modulo punctuation. So this is the underlying model that we are working on but I'm not going to get into too many details of what CRFs look like under the hood. But to motivate this feature labeling idea, so features in this case are the input variables used by the machine learner and a feature label is a rule, just a simple if then rule that can express the human knowledge. So, for example, these are examples I gave before. If the word is proceedings that implies the book title label, if it's four consecutive digits, it implies a date, if it's a repeating sequence of initials maybe that implies author or editor, and the way we want to incorporate this information into the conditional random field or more generally any feature implies some probability distribution over labels given that that feature fires for the instance or that token. And so here are some reference probability distributions. There's a lot of different ways you can formalize this, but let's say this rule implies that every time you see the word proceedings it's a book title or every time you see the digits it's a date and maybe half the time it's an author versus an editor. It turns out the exact probability distribution you can make this more and more accurate but it doesn't seem to matter all that much. >>: I have a question. So imagine I use a bag of words [inaudible] and then the words that the teacher provides use a different regularizer because the problem is different. The teacher has more knowledge that this one is important so it needs to be regularized differently. Now with these two regularizers the correlation between the book title and proceeding [inaudible] is going to be learned with the new regularizer. Wouldn't that have exactly the same effect? But it's very easy to formulize if you formulize this way. You have the bag of words [inaudible] regularizer and what's suggested and you don't need to say oh, proceeding is positive because I have data, I have labels. So it will infer that proceeding is positive but only to specify yes it's always there because I have the data. I can infer from the data. I just need to regularize it differently because the prior is different. >> Burr Settles: So I guess I'm not sure I understand the question. So in this case you can actually do this learning with absolutely no labeled data in the sense that>>: That is the difference I guess. But if I have just even a little bit of labels all these numbers, how correlated this proceeding [inaudible] that will be very quickly inferred, right? >> Burr Settles: I guess it depends on how frequently that label, that feature in that context being labeled actually shows up in corpus. But probably when you're sitting down no work has been done yet. As a human you sit down to think of these rules you're probably going to think of the highest coverage, most salient, most predictive, most correlated sorts of features, at least at the beginning. >>: [inaudible] if you use backup words you’re going to need a few labels anyway. So you can have a few labels. The words you're going to put all have enough data so I'm just wondering if I put a prior I have to think about that is 75 percent of the time predictive of this but now I have to think 75, is it 80, is it 90. If I just regularize it I’ll get the learning to do that. >> Burr Settles: Right. And that's kind of what's happening here. You don't have to think too hard about proceeding is 90 or 80 percent of the time. This probability distribution doesn't have to be very precise because how, and actually in the next slide I'll show how we incorporate these rules, so this is the learning problem. This is the objective function. So we want to minimize the Kale Divergence between all of these reference probability distributions. So this is the feature label distribution that is generated by the fact that you labeled this particular feature. So this could be 100 percent, it could be 50 percent, it could be 80 percent; it doesn't matter as long as it's kind of pointed in the right direction. And then this is the expected model distribution. So given the current parameters of the model I'm going to scan through all of the unlabeled data and 90 percent of the time when I see this feature I think it is this label. I want to get these two distributions to be as similar as possible, so we repeat this for all of the rules that are available subject to some regularization term to prevent over-fitting. It's a little bit complicated, but it turns out you can optimize this using gradient descent with LBFGS. You might even be able to do this in a stochastic gradient descent setting but we were doing batch here. >>: So you could have the constraint that four digits met the date even though there were no four digit dates in your [inaudible] and this would force it to>>: So [inaudible] zero labels. >> Burr Settles: This works with zero labeled instances. It could also be that the rule is bad in the sense that there are many four digit tokens that are part of the page field of the ontology or whatever, but it could be the other contextual information either through the rules or through the annotated instances it's okay if we disagree some of the time with these reference distributions because we're just trying to minimize the disagreement. >>: So that term, the left one is actually fixed, so it's fixed in the update the theta and try to approximate that. >> Burr Settles: Right. You can also, if you have labeled instances then that's another term, your typical objective function for training a CRG is just another term you can add to this. You could add weighting terms that the labeled instances are more important than the regularizers. >>: [inaudible] do that is basically because actually the right term actually [inaudible] so that’s the prior you need to ask [inaudible]. So this is basically kind of regularization. We try to have better weight [inaudible] unlabeled data at the same time try to make the probability of predicting a label given the same feature>> Burr Settles: And match the expectations that the human provides. Did you have another question? >>: I think you answered it. Basically this can be good as a prior and if you have labeled it's going to override it if you have enough labeled data but that is not shown on this slide, the labeled data. >> Burr Settles: But you could, and in this particular research we didn't actually have any labeled data, we were only experimenting with labeled features at this particular. But for engineering reasons we could sort of to do one or the other at the time but in theory, and actually since then, we've been able to train models that are mixed in terms of what types of labels are around. >>: How does actually work for the sparse feature values? So the constraint like the date if it's got four digits I think it's a date. If it doesn't have four digits I have no guess. So in this conditional distribution when you're conditioning is the sum only over things where the feature is present or is it>> Burr Settles: So the sum is over the feature rule and this term is over the expected distribution of all of the tokens where that feature fires. However you can, let me think of how to, so it's possible, for example, that four digits that are dates tend to show up in certain contexts. This regularization term which is preventing over-fitting will start to push the parameter of mass to things like what the previous token was and what the next token is as long as the feature space is rich enough to include those kinds of features. So it's not just memorizing that four digits is a date. This term is causing the parameter mass to sort of bleed into correlated features so you might be able to extract, even without any other labeled data, other types of dates as long as they appear in the same contexts. Does that make sense? >>: So why was [inaudible] you needed labels otherwise you would not benefit from [inaudible] but that was incorrect because you will have nonzero weight on the back of [inaudible] when you have zero weight because it's going to defuse. So this is actually very cool. >>: It seems like you have a separate set of features and they’re independent in some way or separate. So to get this to work does it depend on the set of features that you have so that it would actually work within the scenario you're talking about? Like it has a rich enough set of features, so if I only had a date a conference title and I didn't have anything but previous or next words then it just seems like if people was a date and if conference title and I would give anything else. So can you comment about the features you need to get this to actually work? >> Burr Settles: Yes. That's a good observation. It's not entirely true in this case because the CRF there’s also some sequential information so some of the information might bleed through the mark-off, the transitional probabilities that are implicit in the model. >>: [inaudible]? >> Burr Settles: Not explicitly. You could also even think about generalizing this to say that I think when I see this label 90 percent of the time the next token is this label. Those are probably harder for people to come up with but those are constraints that you could throw into this sort of model. But I think it is true that having some local neighborhood features like just a very large sparse representation just throw the kitchen sink, which is at least five years ago what we were doing, was throwing the kitchen sink of features into these models and letting the statistics sort of wash out, so having like previous token and next token really is what enables you to make inferences about things where the feature doesn't fire. Does that make sense? >>: You've got to have those features otherwise it won't do what you said about spreading the mass away from conference [inaudible]. >> Burr Settles: That's a good point. So to make this a little bit more concrete here's the user interface that we built. It's a little clunky but it's research grade. So what we have over here on the left is a collection of features that are being queried by the system. Over on the right-hand side, so these are organized kind of by distributional similarities, so these are different features that we think are related in some way according to the model. So you can select a particular feature, like the word is Kaufmann, and then over on the right-hand side you see a bunch of examples of that feature being used in context highlighted with a color of the corresponding label that the model thinks it should have at this time in the training process. So in all of these cases it thinks that Kaufmann is part of a publisher field and then up at the top you see the expected distribution. So every time I see the word Kaufmann 95 percent of those cases I think it's a publisher and then one percent I think it belongs to an author which kind of makes sense, or maybe an editor field. And going back to the previous slide, so this is what the reference distribution looks like which is maybe imperfect, this is what the model’s current expected distribution looks like and the regularization term is just trying to get these two pie charts to look as similar as possible. >>: So does the system give you that word list and you as a user you choose from that and label it? This is in some ways kind of like active learning, active feature labeling, right? >> Burr Settles: It is in fact active feature labeling and the next question is how do we choose which features to present because we are exhaustively generating all possible features. We have an algorithm for generating the features and there's millions of them maybe. How do we pick the one hundreds that we are going to show here? >>: So you didn't do a case where a human teacher hadn’t typed in the feature and say I want to label that. >> Burr Settles: Right. So for this work we don't even exercise that case, but the next section I talk about an approach where you can do that. >>: So that KL distance, this is on a token level prediction? >> Burr Settles: Yes. It's a conditional random field so a label assignment corresponds to a state assignment in a path of the token sequence through the finance state machine. >>: So can your Kaufmann feature inform Morgan being a publisher? >> Burr Settles: If you have the feature. Maybe we don't have it here, but you could have the feature the previous word is Morgan and the regularization term will force this to not just memorize that Kaufmann means publisher, but a correlated feature with that is that the previous word was Morgan and the transition probabilities from publisher to publisher are perhaps higher than they are for other things so it kind of implicitly learns that Morgan might have that label as well. >>: I see. So with that effect in the modeling can you restrict yourself to unigrams and you get away with most [inaudible]? So let's say there's a lot of Kaufmann that's not a publisher but you leave the Morgan in there. >> Burr Settles: I see what you're saying. Currently, like I said, one percent of the time it thinks that Kaufmann is an author so there's clearly something else that's informing those decisions. I can't speak to exactly what it is, but it could be that the previous word being Morgan is one of the things that it's using as a predictor that 95 percent of the time it’s Kaufmann. This particular sampling of things doesn't show any predictions other than publisher so I don't really know. Now given like some of the things that I saw with the demo of Ice one thing that interface like this could benefit from is okay, I'm going to select Kaufmann and author so show me all the cases where we think it's author and that might really inform me there are false positives or some cases will show me more like this or something like that. So that's another whole mode of interaction that isn't accounted for in this particular paradigm. Okay. So there's the question of how do we choose the right question? So ideally we would like some complex but well-justified decision theoretic approach to select features to maximize some expected utility. So if we get this feature all ten possible labels how does it minimize our uncertainty about that but that's too expensive to do. So instead we use a more efficient approximation which is just the expected uncertainty of all the tokens of the label distribution for all the tokens where this feature fires weighted by some term of just how frequently it fires. So we want to sort of an exploration, exploitation kind of thing. We want to explore things that we’re uncertain about but also restrict ourselves to things that are likely to have impact because they are occur frequently. So this is a heuristic that we tried. Here is results from a simulation study with two benchmark information extraction data sets. So one was the>>: What is the question? Is it this a part of the feature? >> Burr Settles: The question is here is a feature, label it for me. >>: So you're trying to solicit the four digit regular expression and you have some corpus of hypothesis, your bag of words, and you want to surface of some of the candidates one by one, right? >> Burr Settles: Or in this case we are actually showing a bunch of them. So the form of the query is I want to pick which 100 words to show in this interface over here. Then when the user clicks on one of those features I can see what the model currently thinks about that feature and then I can provide some feedback like directed feedback based on my domain knowledge that this feature implies this label. >>: [inaudible] Kaufmann you might want to say this is a publisher. >> Burr Settles: Yeah. So there are there are these buttons at the top. So I could click on Kaufmann and then click on publisher. And can see some of these like eds implies editor, data implies title, programs, parallel language, I guess they're all titles. >>: Those are human labeled? >> Burr Settles: Yeah. So the things that are in parentheses that are colored those are human provided labels in this particular batch of 100 feature queries. >>: So in theory you could have these slash D4 as another entry [inaudible] and the human can say this would be here. >> Burr Settles: So there's a couple different ways from an engineering perspective you could do that. One is before any annotation is even done you could give the feature generator factory or sub process a language to create these regular expressions. Another one is if we had the facility to inject new feature labels we could just write the regular expression on the fly and then have it retroactively apply to everything. >>: So we can flip the grouping here? The whole group of five? >> Burr Settles: Yeah. So this is a little less intuitive than I would like, but basically out of the 100 features there we run a quick clustering algorithm and kind of these are different clusters of things that show up according to the features that co-occur with those features, so the idea being these features tend to show up in the same contexts so if we present them together it will reduce the cognitive load. They probably have the same label and so I can just kind of go down the list within a particular bucket and hopefully they all have the same label and I can go through them quickly. >>: [inaudible] label on feature at a time? >> Burr Settles: For this particular interface. >>: Why would they have the same label because in some ways they're just like Morgan, Kaufmann 1992. 1992 doesn't have the same year as Kaufmann. >> Burr Settles: Right. So those two things probably wouldn't show up in the same cluster though. But maybe, I'm trying to think of another publisher name, maybe the word press would show up. >>: [inaudible] Springer [inaudible]. >> Burr Settles: So it's an attempt without actually knowing for the machine to organize these things. We didn't do too much research and engineering into the HCI side of this. They were kind of based on intuitions but we didn't try many controlled experiments. So this is the heuristic we are going to use. Here’s some experiments. We had two data sets. One was the research citation running example, another was Craigslist apartment classifieds trying to extract location, contact information, rent, utilities, etc. We compared combinations of active versus passive learning and feature labeling versus fully labeling instances and we evaluated the learner after 10 minutes using a simulated annotator which the details of the simulator it was something that had already been used in previous research on feature labeling with conditional random fields so the details are in that paper. Here are the results. So after 10 minutes this is the token level accuracy of active learning where the features were selected versus this sort of passive heuristic. We used Latent Dirichlet Allocation to sort of cluster the features and we picked the most frequent features out of each cluster so it's another hybrid. It's passivish but activish kind of but it was a previous approach that had been used in these kinds of interfaces. Then this is active learning by selecting instances using uncertainty sampling versus passive learning, so a random selection of instances and fully labeling instances. And the cost functions here I don't have them on the slide, I don't remember exactly what they were, but empirically in some user studies we found that it took like 2 to 4 seconds to label a feature versus 10 to 20 seconds to label an instance. So that's how we're estimating the time after 10 minutes of annotation. >>: Just to be clear, instance label that means you actually have a [inaudible]? >> Burr Settles: The whole thing. It was a fairly efficient interface where you clicked and dragged and the whole token is highlighted. >>: Does it update every time? So any label doesn't refresh with a new set of features? >> Burr Settles: No. So you do all of the annotations and then click learn and then it retrains that comes up with a new set of features. >>: So in your simulations how many do they label in each batch? >> Burr Settles: So in these simulations, it's been long enough I don't remember the details, I think we kind of assumed that maybe there was an instantaneous refresh or something in estimating this cost function which in reality is maybe if anything it's probably giving an advantage to the instance labeling as opposed to the feature labeling. >>: How do you select which feature to>> Burr Settles: Using the heuristic, the expected uncertainty times the log frequency. >>: So out of that list of 100 you use that to select [inaudible]? >> Burr Settles: Right. So we take the top 100 and then there's the oracle, the simulated oracle that says okay, out of these 100 these are the ones that I want to label. So this experiment and the next one, so the results are even more stark for the Craigslist classifieds. In some sense you can't really trust of these because they're simulated experiments anyway and one does not simply simulated annotators. So again, going back to if the cost function is the number of instances maybe you get some great gains but we want to see what this looks like in a real setting. So we ran a couple of actual user studies using that interface. So here are the results for one user annotating the researcher citation data set. For this we only did two conditions, so active learning for labeling full instances which is the green line versus active learning labeling the features, and here we see in 10 minutes of work we get significantly more accurate extractors using the active feature labeling paradigm and the same for a different annotator on the same data set. >>: So the features have been like previously defined instead of features? >> Burr Settles: You can think of it as like a bag of words. Then there are some context features and capitalization features, but yeah. The vocabulary of the feature space has already been generated in these experiments. >>: Is this starting from zero examples or did you do any study from some number? >> Burr Settles: So if you notice these learning curves start at two. That's because there was already two minutes’ worth of work done annotating just the most frequently occurring features. That's how we started. Or maybe we started with the LDA approach for the first batch. >>: So if you already had like two thousand examples do you think you would be able to label the remaining features in a meaningful way? Is it because you're starting off from Tabula Rasa that you can come up with things? >> Burr Settles: So my intuition, and the next section I have some evidence for this, is like you get more mileage out of the feature labeling early on. So if you already have 1000 labeled instances you may not get as much out of the, actually I'm not sure that's true. But I think this initial spike that we are seeing in both of these cases partially comes from the fact that we didn't have any information at all and we are providing these very targeted low hanging fruit kind of domain knowledge about the feature labeled associations. >>: How do you choose [inaudible]? >> Burr Settles: I don't know. No more than 1000 because, no more than 500. It's maybe 100. >>: So it's one per second? >> Burr Settles: It takes about 2 to 4 seconds. So another part of the reason I'm hesitating a little bit, what? >>: That's very efficient. >> Burr Settles: Yeah. The feature labeling in both this paradigm and the next section turn out to be very efficient. >>: But do they spending a lot of time like deciding which ones to [inaudible]? >> Burr Settles: This is the actual time that they spent. This isn't a simulation. So I don't know how much, I don't think we went back and looked at how much time they spent doing one thing or the other>>: [inaudible] second? >> Burr Settles: I think it was 2 to 4 depending on the data set. One of them was two seconds per annotation and when you take the amount of time that was spent and the number of annotations that were done divided by the amount of time was spent turned out to be two seconds I think for this data set and four seconds for the other data set because it was two and four for one or the other. >>: So there's a UI question, right? So in this [inaudible] show when a user is labeling features then do I run other things to label? >> Burr Settles: That shows my 100. And then you kind of have to click retrain and wait for the next batch. >>: So whatever I labeled would not appear here again because those are features that already received [inaudible]? A fresh grid of 100. >> Burr Settles: In this interface you can't go back and revisit them either. >>: So does that mean if it’s that fast then users are probably not looking at the example’s contexts that much, right? Because to click on an example, sort of see what the system thinks, and then either change their mind or label it would probably take longer than two seconds, right? So we are probably maybe just using their own>> Burr Settles: [indiscernible] context that’s true. >>: This is good. The features here are recognition tasks. They're not generation tasks. The users don't need to think about words. >>: Right. But you might want to see the context to know what your label>>: That’s true. >>: [inaudible]. >>: It's interesting because the clustering you have actually gives a type of context in a way so it may be that that's actually why you can go fast because of that clustering. >> Burr Settles: Yeah. You don't necessarily have to look at the empirical context. >>: So what is the first table you see? When you have a zero label because there maybe not word in there that you ever see in this addition. >> Burr Settles: I don't exactly remember what the first batch is, it’s either the 100 most frequently occurring features or we ran this LDA, the Latent Dirichlet Allocation and then picked>>: So why would you get anything that is multi>> Burr Settles: You're hoping that by picking the most frequent or representative features that hopefully they have some predictive value. >>: With zero labels? >> Burr Settles: Even with zero labels. >>: For domains the high-frequency words are probably>> Burr Settles: Like proceedings shows up all the time. >>: You're only seeing citations. [inaudible] if it’s not generic. >> Burr Settles: Right. So everything that you're pointing out are definitely limitations of this particular implementation of the idea in general. Did you have another question? Then these are the learning curves for third user on the Craigslist classified data set. So here even the active feature labeling is winning all along. So here's the sales pitch sort of slide that this is a new framework that we devised that outperforms passive learning with feature labels and active and passive learning with instances but there's a dirty little secret and you guys have been hinting at it all along. There's a few different dirty little secrets. So one is that the experiments are not exactly interactive. So what happened was the users worked on a particular batch for two minutes and then they waited for the CRF to retrain which took about one half hour or an hour and they went ahead and did something else and then the interface would ping them when it was ready and that they would work for another two minutes and so on. So those learning curves, the reason they were kind of jagged lines that went exactly from two-minute intervals is because they weren't exactly contiguous. So we were lying and cheating a little bit. >>: It's kind of shocking that labeling instances would perform so poorly after labeling features. I wonder if it's an artifact [inaudible] CRFs being very hard to train. >> Burr Settles: I don't think so because those results are consistent with the next section which was using a completely different model on a completely different kind of task. >>: Even for just document classification do you think that the [inaudible] labeling feature would still win? >> Burr Settles: Yes. I have evidence that it still wins. But some other limitations of this work is that the human can't volunteer the feature labels so you're restricted to just labeling the features that are queried. So ideally we would like even Tabula Rasa before anything starts to be able to think the word proceedings it didn't show up on my grid but I want to tell the algorithm that the word proceedings implies book titles. And we can only label features. It would be nice if we could mix the annotations of both features and instances. So because of this we are not quite A plus. We are at a B better than average kind of paradigm but you're not all the way there. So now I'm going to switch gears and talk about this system called DUALIST that I built for interactive text classification which is an attempt to address basically all of those previous limitations. >>: Before you do this imagine a way to build a classifier for basketball. >> Burr Settles: And it documents whether this document is about basketball? >>: So at first the word football is positive compared to the rest for basketball because football and basketball may appear together or something like that. But after I have the word basketball that football becomes negative. So what happened is the feature can be positive first, then I put another feature. The difference between the two sets now becomes negative. Then I put another feature that explains this negative [inaudible] because positive again and what happened is that since I'm combining the features in a very nontrivial way the human intuition gets completely off as whether it's positive or negative given the other features. Now every single feature we can’t predict information. We can really not predict which direction it's going to push in the context of other features. Is that a problem? >> Burr Settles: It could be a problem. This gets back kind of, I think it was Jerry that asked a question about being able to revisit feature labels and changing your mind later which>>: But this is not something [inaudible] possibly understand. But if he selects the feature football you can say it's negative and the constraint is that in the margin on your unlabeled data set that your model predicts that more of these are not basketball than not it’s saying anything about the weight of the thing in your model. >>: You really want to go back to the conditional though. The conditional given just football is positive. What you're trying to differentiate is the conditional given football and basketball and that's different which you can potentially ask about in context things but that's a different judgment then with overall [inaudible], right? >>: Right. But the information that the user provides turns out how much information am I getting? At some point it becomes, so that's one approach. Then I look at another approach [inaudible] and I look at the weight of the different things and they were things that I would have guessed the word positive because by themselves they would be positive and then I look at the weight and the bag of words are in the negative and clearly the two things are carrying different intuitions. One is in context of everything else but the humans can only reason in very, very limited context. >> Burr Settles: So you can imagine the>>: [inaudible] providing the wrong information just by working in context. >> Burr Settles: So you can think about the human being able to understand that if this word appears and this other word does not appear or if they both appear that's a different part of the if clause of the if, then rule. You can imagine that the active learning submodule of this system being able to notice these discrepancies empirically when the cross product of this feature being present or not, so there’s these four different conditions, the label distributions are different and maybe being able to synthesize a conjunction or a disjunction kind of rule and then asking you about that rule. This is something I'm just thinking about off the top of my head now that would help. It would increase the size of the feature space and make it more complex and the search for generating good candidates for that might be really difficult, but it's possible. It’s something that you can do to help with the situation that you're describing. >>: My experience is that it works for the first ones and then>>: So I have a different interpretation of this which is your intuition of the sign changing is the sign of the [inaudible] weights in the model. But if you look at the KL term that's actually in the model they are not directly constraining the sign of any coefficient. It's a very different kind of regularization. It turns out that will take care of the issue you mentioned. >> Burr Settles: In the case of a logistic regression with that kind of term that's true. >>: You could have a positive feature to a negative weight in your model. >> Burr Settles: Because all of these other features are overwhelming the prediction. >>: What it means is that at some point [inaudible]. You should suggest a feature [inaudible] weight what information>> Burr Settles: Now it could be that just modulo everything else it’s a positive predictor but you have so much other information either through the labeled instances or through all of the other labeled features that it kind of overwhelms that model. These are the real reasons that is positive and this other positive label is really doesn't contribute to the terms. >>: So you're saying that the user suggests a word, the generalization error increases, but decreases the moment it gets better even though the weight is still negative? >> Burr Settles: Right. It could be that if the weight was zero it's still positively correlated but modulo everything else, but the objective term would change hardly at all if you just took the feature out completely. But again, maybe we can take this discussion off-line, but I think Jerry was going in the right direction where at least for the case of the discriminative model and this way of incorporating the domain knowledge into the regularization term it kind of takes care of itself. It might be even more of a problem for this system which is called DUALIST. So this is a task classification system. I don't know what's going on. Well, that's unfortunate. Oh, there it is. Sorry. So this is a quick demo of the system. Where did I put this? There it is. Data. I should have been more prepared. So this is in a sentiment analysis example. So what the interface is doing, let's see if this messes things up to make it bigger. So what we have here on the left are actual documents which currently, because nothing’s been labeled so far so the documents on the left are randomly selected, and these are movie reviews from the standard sort of sentiment analysis task, and we have two labels here: positive and negative and right now these are features. So the bag of word feature representation which right now are just ranked by their frequency in the corpus. But before we even do anything I can think of the fact that positive reviews great and terrific are predictive of positive movie reviews and maybe bad and terrible are predictive of negative movie reviews and I can just click this button and it's already Incorporated that into the model and now it's asking me using uncertainty sampling about some more documents over here on the left and a rank documents that it wants to ask if these are words that are predictive of positive and negative reviews. So if you scroll down let's see uplifting might be associated with a positive review, Oscar, good Oscar contenders, perfectly, provocative, compelling, wonderful, superb, powerful, magnificent; on the negative side we have waste, like this was a waste of time, stupid, laughable, dumb, pointless, ridiculous. So so far we haven't even labeled any documents. Worst, SQL, and maybe Schwarzenegger is maybe associated with- >>: Why are there fewer candidate active words? >> Burr Settles: So this is actually kind of an artifact of the way I'm picking things which, a differently could pick things. In a couple of slides I'll go into the algorithm, >>: [inaudible] select a positive box>> Burr Settles: The selected positives are at the bottom. So you can actually go back and revisit and change your mind about labeled features in this particular case. Jean-Claude Van Damme is maybe a negative predictor, awful, subtle, anyway you get the idea. So far we haven't even the labeled any instances. And we can look at, let's pause and look at the predictions. So this is again a research grade interface but over here this is the predicted label on the far left, this is the confidence of the model of that label, and then this is the actual label. So you're starting to see there's a few mistakes like we are predicting this document, this review of Species 2 we think is negative when in fact it was positive, but in general you'll see there's a lot of parity between the predictions and the actual labels already even though we haven't labeled any instances and we haven't labeled that many features. And then we can go through>>: [inaudible]? >> Burr Settles: No we don't because there's sort of two modes. There's explorer mode which is what I'm using here which you don't need labels to be able to use the explorer mode. >>: What I mean is you have all those yellow things highlighted. That's for our human benefit but this in no numbers on there. >> Burr Settles: Right. If I tried DUALIST and there's also an experiment mode where it requires labels then we could get an actual accuracy estimation. >>: [inaudible]. >> Burr Settles: Those are the true labels. >>: So this is above 90 percent accurate? >> Burr Settles: Maybe. Actually if you want to bear with me a minute I can very happily, using command line, I can kind of the figure out what the confusion matrix is. >>: I think it's about 90 percent. >> Burr Settles: It's probably, I'm not sure it’s that high but>>: [inaudible]. >> Burr Settles: That's an artifact of the underlying model which is naive-bas. So let's jump back into the talk and I'll talk about how this actually works. So it's a multinomial naive-base because it's been known to work well for a lot of text problems, it's basically linear training time, I'll skip over the actual modeling. So here's a cartoon illustration of how the learning algorithm works. So we have our vocabulary of words, we've got the two copies of each of these features one conditioned on it being negative, one conditioned on it being positive and then we start with your typical symmetric Direchlet prior or La Plas[phonetic] pseudo-counts of the one just for smoothing. So if we have a labeled feature, so I can inject some of knowledge by just typing in a word on this particular label, what we do is we just pretend like we hallucinated five examples where the word bad occurred in a negatively labeled document and fantastic in a positively labeled document. So we update these counts with the labeled features like so and then if we have labeled documents we can do a sort of variant of maximum likelihood estimation. So we have all these negative documents, we can update the counts for the words that appear in those document on top of the priors that came from the labeled features, the same thing for the positive documents>>: What’s the total number of documents? >> Burr Settles: It depends on the data set and what we were just doing I think it's 2000 documents. It's not that huge, but this is still linear training time. I've used it on>>: That means five comes from [inaudible]? >> Burr Settles: You mean the pseudo-count? So it turns out it's not that sensitive. So I ran some experiments for different values of this and it turns out invariant of how big the actual data set is it doesn't matter that much what this pseudo-count is as long as it less than 100. For some reason I don't fully understand theoretically why but if it’s less than 100 it seems to work pretty well. And then from here we can actually do one step, not to full convergence but just for the sake of speed one or two or three or finite number of steps of the EM algorithm where we can probabilistically label all the unlabeled documents and then fractionally update the pseudo-counts depending on the probabilistic wavelengths of the unlabeled documents and then re-estimate the parameters based on that and that is our final classifier; and then using that classifier we can do the active learning steps by scanning through all the unlabeled data sets, selecting the most uncertain ones, and then also scanning through all the features and selecting the ones with the highest expected information gain, so the ones that have the largest difference between the expected uncertainty minus the expected uncertainty of the documents that have that feature present. And then what I'm doing is ranking the top 100 or 200 of those and then organizing them into the columns by the model's expectation that this one occurs more frequently with this versus that label and so the reason there was this imbalance after a couple of iterations was just because the most informative features tended to be more correlated with the positive label but you could also do a ranking, choose a slightly different algorithm that would keep those more balanced. So I'll discuss some more user experiments. >>: [inaudible]? >> Burr Settles: Just to keep things fast and tractable and not necessarily run to convergence. You could also do two, three, four. But the important thing is that it's a finite number of steps. Another thing is I think empirically I played around with this and because the model is so strongly biased when you're labeling features it can rarely run off the rails if you do more than a couple of steps. So there's two reasons: one for speed and the other just to keep it from believing its own press too much. >>: [inaudible] train the logistic regression on your expected counts to see if the probabilities [inaudible]? >> Burr Settles: I'm not sure what you mean. >>: So you do an [inaudible] step. So for every document you have a fractional label [inaudible] so you could use that as your external data for your logistic regression. >> Burr Settles: Oh, I see what you're saying. So kind of using the naive-base as a proxy. So I suspect that probably won't work that well because the posterior distributions from the naivebase classifier are still overly polarized. So Patrice was pointing out that those confidences looked a little scary. >>: So that's a problem I was trying to address is that if I have so many positive words, so by retraining with logistic regression you could avoid the double counting and [inaudible] makes your probability crazy. >> Burr Settles: Right, right, right. I think what might work better is actually trying to some other approach instead of naive-base we are just actually using logistic regression and then incorporating either using generalized expectation like a regularization approach or something that is a little more computationally efficient and you can do incremental updates. That would be better. But this work is maybe three years ago and I haven't really been working on it since then. That's a direction I wanted to go at the time for sure. >>: Very quick comment on this. Your [inaudible] prior with [inaudible] of number five is that equivalent to have a pseudo-document with [inaudible] wording [inaudible]? >> Burr Settles: No. It's actually, actually I guess it is. Or five documents that have that word once. That's true because it's a multinomial it’s the same thing as, yeah. >>: Is that the same thing? Five positives with four numbers>>: You create five pseudo-documents. Each one contains only one word and so forth five times. >> Burr Settles: That should be the same as one document that has nothing but the same word five times in the case of the multinomial logistic regression event model. If it's a binomial event model then it's different. >>: So I think the most [inaudible] in very short [inaudible] you can get absolute [inaudible]. So this value [inaudible] the thing I'm wondering about is does this get you a reasonable accuracy superfast but if you were to compare it with just building a model with labels in the traditional way how far will [inaudible]? >> Burr Settles: Hold onto your seats. So here this is a more exhaustive set of experiments than the previous paper. So here we used three different data sets with five different annotators labeling those data sets in these three different experimental conditions and the order of the conditions was randomized to reduce some kind of presentation bias or learning about the problem. So we used DUALIST in the flavor that I just showed you versus active learning on documents only, so this is no feature interactions at all, versus passive labeling of documents only. So we randomly selected the documents and documents are all that you can label. And we ran these in six minute trials with across all, to keep things consistent there was a fixed 90 percent unlabeled pool because they could actually label things differently than the gold standard but we pulled out the unlabeled pool at 90 percent and then a fixed 10 percent test set. So these are the learning curves for 4 different distinct users on the same the problem. So this is the classification of university webpages into faculty student research project or courses. So there are four different labels. The red line is DUALIST; the green line is active learning but documents only. So the first thing to notice is that both the active learning sort of paradigms are better learning curves then passive and the cost function here, the X axis, is actual wall clock time with no pauses in between the retrainings is actual contiguous six minutes of work. The dotted line at the top is the accuracy of the classifier if we trained on the full 90 percent label training set. So in all of these cases we're getting within 90 percent of the upper bound accuracy within six minutes with DUALIST. >>: So what's the X axis on the top? >> Burr Settles: It’s in seconds. So this is 360 seconds. >>: So what's the [inaudible] bottom and the top? >>: Four different users. >> Burr Settles: These are learning curves for four distinct users on the same problem. >>: What was their interface for the active labeling and passive labeling? >> Burr Settles: So it was the same as DUALIST it just didn't have the columns on the right for labeling the features. >>: So you would hire the top right user? >> Burr Settles: This person seemed to get it right and just really nailed it out of the park. >>: So at the end of six minutes how many documents did they managed to label slash features, right? And you said that dash line is 90 percent of the upper bounded size which would be like 1000 something. >> Burr Settles: So in this particular I think there was 1000 documents, there are 4000 total documents. >>: 3000 documents. >> Burr Settles: So this is training like 32,000ish labeled documents. >>: But for six minutes how many documents can people label? >> Burr Settles: That's a good question. I think I might discuss that in the paper but I don't have it on a slide. >>: [inaudible] less than 100. >> Burr Settles: It's not that many. So these are the results for one data set. We get similar results for this science data set, so this is one of the 20 newsgroups subcategories with four different labels. I don't remember exactly what they are but again, we can get it within about 90 percent accuracy of the upper bound but the gains are even more stark. >>: Is the upper bound still [inaudible]? >> Burr Settles: It's the same model. It’s just trained in a traditional supervised way. >>: So you said you're not testing against some ground trick you’re testing their own labels? >> Burr Settles: No. So the test set is consistent for every align you see and it's the ground truth gold standard label that came with the distribution of this benchmark data set. But it is possible that a user in one of these conditions gave a label that disagreed with the ground truth. >>: What percentage of data is actually about science? >> Burr Settles: So it's all about science there are just different fields of science. So one is like astronomy and one is, I don’t remember what the four categories are. >>: So you go with the most prevalent categories of science? [inaudible] the most common? >> Burr Settles: I think they are somewhat balanced in this particular data set. >>: [inaudible] labels and labeling documents and features? >> Burr Settles: They can inject features, they can a label>>: [inaudible] how frequently they use the two? >> Burr Settles: Right. So in general, I think I have a slide in here somewhere that visually shows it, people tended to start out, actually let me see if I can just to find that slide because, here it is. So at the beginning it's fairly distributed, this is the amount of time or the number of actions that they did say the first minute. It's pretty evenly distributed between labeling documents, labeling features, and volunteering features, so just typing in brand-new features. And over time they do less volunteering and less feature labeling and spend more time labeling documents. >>: [inaudible]? >> Burr Settles: Probably because the low hanging fruit has been picked. >>: [inaudible] feature’s work may be as good after a while? >> Burr Settles: After a while the features stopped being that interpretable maybe. >>: It would be interesting to see if you have just have people just label documents and then show them features and see if they get added value at some point. Is it only valuable in the beginning or is it also valuable [inaudible]? >> Burr Settles: Yeah. So somebody else asked a question with the previous section and that's a good question. I'm not really sure. I think there would be still some utility but it would be less. I mean there's diminishing returns in general with all of these things. >>: Does the user have any sense of progress? Can I tell that I added a feature to get better or worse? Do I have any idea? >> Burr Settles: With this interface, no. So, like some of the things that you’ve built into the Ice System are really good like the visualization of the spectrum of predictions and whether or not they're actually positive and negative and you can kind of visually see where the false positives and false negatives are and then go in and try to understand why. I mean none of that is part of this current interface. >>: Why would anybody volunteer a feature? What's my motivation for doing that? Why would I do it in the first place? >>: They don’t know the impact you mean? >> Burr Settles: It’s a type of domain knowledge that you have that you want, you come to the table kind of knowing this about the problem. >>: [inaudible]. I mean I can give a label or I can give a feature or I can type something in and I'm not really getting any direct feedback of that was great or that wasn't so great so I have to make a decision on a policy without really>> I think the question is how do you make the teacher better? >> Burr Settles: How do you inform the teacher more? >>: So another way to interpret the curve is the best is 98 percent You get to 90 percent pretty quickly which is five times the error rate [inaudible] is not that great. You can say oh, it’s 90 percent accurate>> Burr Settles: It’s five times the error rate in six minutes as opposed to two weeks. >>: I mean there are many problems for which [inaudible] 98 percent one is usable and the other isn't. >>: I think another way to ask the question about this is you did this for six minutes. If you add another 10 minutes what would you do? Because you're only running this for six minutes. What would you do to get it better in the next 10 minutes? >> Burr Settles: Another artifact here might be that you're maybe a little bit limited by the classifier. So the next slide I was going to show was>>: So is the 98 percent the human performance versus [inaudible]? >> Burr Settles: The 98 percent, this upper bound, this is using the gold standard labels for everything. So the results here were less amazing. DUALIST still seems to be consistently better and this user apparently was doing something right. But this is a trickier problem because [indiscernible] analysis is tougher than some of these other clustering. The other data sets were in some sense a little more clustery so that generative naive-base one captured it better whereas you’ve got much more subtle and non-conditionally independent features going on with this particular data set. So it could be that swapping out the paradigm for a logistic regression classifier would actually do much better here. And like what you were getting at, facilitating some information back to the human of what was actually learned and what actually went into making different decisions would be helpful in deciding oh maybe I'm going to undo this feature label, this kind of domain knowledge. So it's almost 5 o'clock so I could either stop now or I could briefly power through a couple of other salient points. How do people feel? >>: Power through. >> Burr Settles: Power through. Okay. So here's an interesting failure case of basically a user who didn't label any features until right there and then the accuracy started to jet up. So basically on the movie reviews data set there was almost no difference in behavior between the three preconditions so we didn't really see any gains. Here was another one where DUALIST started out looking good and then sort of flat lined and went crazy and it's because the user, this was for the WebKB corpus and this particular user, basically after two minutes or so of labeling the features that were queried just kind of stopped making sense and the only column that made sense was the course labels. So they just spent all of their time in feature labeling land labeling those features and not any of the others and there wasn't a good mechanism in the training algorithm for this at the time to account for that in-balance of the feature labels. So it started running off the rails and the active learning with documents turned out to be better. This is an open source sort of thing you can download and play around with. So these are some of the strengths but there's this question of why it’s not always better and you guys have been getting at it. So we are A minus. We still haven't gotten all the way to an A plus. So can we get to A plus? I'll just power through this. So Jerry and I three years or so ago tried to replicate, we used the data for a few different things, but one of them was to try to replicate these results using regular people or turkers instead of all the users, I have a confession, all the users in the previous studies were graduate students in natural language processing and machine learning. So we kind of understood the problem a little bit more then in general. So the idea is can we replicate the user results? And maybe not. So we had 32 or 33 different turkers in these different conditions try to train things and basically just the learning curves weren’t statistically significantly different. However, if you look at the final accuracies, the distribution of final accuracies of the models after I think these were 10 minute experiments, clearly some of the users who were given the DUALIST interface were doing better than the others. So the question is what were they doing that helped? So let's review the different things you can do. You can label documents, you can label feature queries, and you can volunteer features. So these are the three different kind of first-class activities; and it turns out that volunteering features that behavior seems to be really predictive of the final accuracy at least in this really short sort of 10 minute restriction. And some evidence for this, so we went through and out of the 30 or something individuals in that group we found these three behavioral subgroups. So 11 of the 33 didn't label any features at all. They only labeled documents. They didn't volunteer features, they may have labeled some. So the small group that labeled a lot of features and then another kind of a third or so of them labeled a few features. I forgot what the cut off was. >>: [inaudible] type the worst that’s already in the list did you count it as volunteer feature or main feature even though it's already running? >> Burr Settles: It's already on the list. Basically if it's in the list but not labeled in the list then that counts as volunteering it. If it’s already in the list and labeled then it's sort of a duplicate and we throw it away. Does that make sense? >>: Maybe when you go home today you can also [inaudible] faster because you don't want to [inaudible]. >> Burr Settles: You don't have to scan through the list. >>: There's also the number of features created, right? >> Burr Settles: Right. >>: Let me just clarify. So there was this word like excellent in the list and I didn't read i really carefully but I thought about excellent. >> Burr Settles: And you typed in excellent? >>: Yes. >> Burr Settles: That would count as volunteering a feature even if it was already asking about that feature. That's not the mechanism through which you provided the information and so then if we look>>: [inaudible]. This is the only modes that can be in active learning. That was the point I was trying to make but you don't have enough time. >> Burr Settles: Anyway, so these are the learning curves for the people in those conditions, and here we see some statistically significant gains; and this is a linear regression to predict final classifier accuracy as a function of the number of times people did these different actions and the only want to have a statistically significant effect is whether or not they volunteered words. >>: Can you show [inaudible]? >> Burr Settles: On this? >>: Yeah. >>: I think increase is a good one. It could be explained by something other than the group of things you had listed there. It could be the number of the features that they labeled and by typing they could label more. >> Burr Settles: So that's a possibility. It may also be>>: [inaudible] list there for your task. >> Burr Settles: I kind of suspected that didn't happen very often, but it may have. Another explanation is that there's a hidden variable here that maybe they're just good labelers and the good labelers happen to also want to take advantage of this behavior. So it's correlated but not [indiscernible] >>: I doubt it's the difference of the number of labels because everything is on a time axis there and it would be surprising if turkers were all that fast as typists but really bad list scanners. It would be weird. It might be faster to type then not scanning the list. It wouldn’t be consistent with any other scanning type study. >>: They probably scan the list and select all the ones that make sense there and this is measuring that stuff that they're adding to it, right? >> Burr Settles: Well, all of the actions are included in this cost function. >>: I think there is a difference though because at some point you’re not finding new words and you have to scroll and scroll before you find all of them as opposed to when you type then you type one word instead of reading 100. >>: There's a real informational difference but I don't want to take his time. >> Burr Settles: Those were actually the main points I wanted to make. Maybe we can still get to A plus, but there's a lot of things to learn and I think it’s inherently disciplinary in that we want to, it takes research in machine learning to figure out algorithms that can pose questions in these different mixed tradition of modalities, make use of the human domain knowledge, this is the next bullet point. So in human computer interaction there's both interface design which I haven't done as much of and it sounds like you guys have been doing a lot more, helping the human to actually understand what's going on in the model like actually, understanding what the machine has learned, as well as most of the work I've done is facilitating the communication from the human into the algorithm in these different ways but not so much understanding what the algorithm is learning being able to go back and debug in the artifacts. And then there's also this human behavior aspect of what kinds of things to people tend to do, how do those impact efficiency, are there ways to design the interfaces to encourage those activities? Because if you end up with these richer multimodal mixed initiative interfaces now instead of one thing the human can do 10 different things and so how do we optimize getting the human to do the right things at the right times? So that's kind of all I have at this point. Only eight minutes over.