>> Rich Caruana: It is my pleasure to introduce Been Kim and have her here for a talk today. Let’s see you graduated from MIT in the last year or so? >> Been Kim: Yeah this year, this summer. >> Rich Caruana: Okay, great and you have been at AI2, the Allen Institute for Artificial Intelligence for what 6 months now? >> Been Kim: 2 months. >> Rich Caruana: 2 months, okay so you are very new to the Seattle area. The weather will get better in 6 months. Been Kim has been doing research in interactive machine learning and she is going to talk about that today so welcome. >> Been Kim: Thanks for the introduction and thanks for inviting me to speak at MSR here today. Today I am going to talk about my PhD work on interactive and interpretable machine learning for human machine collaboration. A quick vision of my research is harness the relative strength of humans and machine learning models. Humans have years of accumulative domain expert knowledge that machine learning models may not have, whereas machine learning models are able to perform complex computations efficiently and precisely. The goal of this work is to have them work together in order to help humans make better decisions. When we want to achieve that we need machine learning models that can intuitively explain to humans what are the results of machine learning methods for humans who may not be machine learning experts. We also need machine learning models that can incorporate human domain experts’ knowledge back to the system in order to leverage that. So my research objectives are developing machine learning models that are inspired by how humans think that could first infer decisions of humans. We first need to know what decisions humans have made in order to maybe make a better suggestion. So I built a model that could infer humans’ decisions from human team playing conversation for disaster response. The next part is about building a machine learning model that can make sense to humans, clustering methods that could intuitively explain to humans what the results are trying to say. And finally to close this feedback loop I built a machine learning system that could incorporate human domain expert knowledge back to the machine learning system implemented and verified in a real world domain computer science education. I will talk about that today. So among these three portions I am going to focus on these last 2 sections of my thesis: make sense to humans and interact with humans. So let’s jump right into building machine learning models that could communicate intuitively to humans about its machine learning results. When you think about building machine learning models that could intuitively explain and communicate with humans you have to maybe first think about how humans think. There has been rich cognitive research that shows that the way that humans make these technical decisions are based on [indiscernible] based reasoning. Particularly if you are a skilled firefighter the way that you figure out what you are going to do with this new incident, maybe a new fire, is you think about all the previous incidents that you have dealt with, think about the closest example to your new case and apply a modified solution to this new case. So I argued that if we want to build a machine that could support better human decision making we need to represent the information in the way that humans think. >>: My intuition is exactly the opposite. So if I want to teach another human how to buy a car I could go and point to the color of that car or I could teach my feature, which is that you should look at the reliability record, you should [indiscernible], those are features. So I teach my feature, I don’t teach by example. You don’t teach by teacher. >> Been Kim: So imagine a case where it’s not a car anymore. Imagine a data point where you have thousands of features. Can you teach those thousand features and enumerate all the features to humans? It might be difficult, right. >>: [indiscernible]. >> Been Kim: I think that is a really good point. In fact the work that I do here is roughly what you are saying, pursuing scarcity. So example with features that are important and those are the ones that I point out along with the example. >>: Oh I see, so you teach individual feature plan example. >> Been Kim: Yes. >>: So you still teach by feature, but you teach and do your feature by example? >> Been Kim: So example with the peak features that really matter. >>: I teach an example that supports this. If you want to teach a pigeon how to add the first thing you do is put a piece of tape in the cage and you reward for being the right half. So you are teaching by example, but being the right half of the cage is a feature. Once you have taught that you move to the tape then you reward for pecking on the wall and in some way each of these is a feature, but each of these features are taught by example. >> Been Kim: Right that’s true, a very similar idea, exactly. But this idea of example based is really motivated by people studying humans. When they study there are people making important critical decision. The way that humans’ brains work is by example so it’s to leverage that sort of rich cognitive research as a start. And leveraging the examples and intuitiveness of examples has been studied, it’s not new. In classical AI it is called a case based reasoning. Case based reasoning has been applied to various applications successfully, but case based reasoning always requires labels. So if you are trying to apply cased based reasoning to fix your car for example you need to know all the previous cases that you tried to fix your car in order to decide what solution is appropriate to fix your car right now. It doesn’t also scale very well to complex problems. I think going back to your question if fixing your car requires you to read pages and pages of document it starts to lose that intuitiveness. And of course it’s not designed to leverage global patterns of the data. We do have machine learning models that can leverage global patterns of data and in particularly related work as interpretable models. There are decision trees or sparse linear classifiers in order to do this. And what I mean, which I just said, is if you give me high dimensional data points then I am going to select the subset of the features that are important and give it to you so that we can scale to complex problems. My work is about combining these two by maintaining the intuitiveness using examples while leveraging global patterns of the data using machine learning, but pursue scarcity so that we can scale to large complex problems. So our approach we call it Bayesian Case Model and formally we leverage the power of examples which we call prototypes and subspaces that are hot features, features that are important. This is just a way to say, “We have this complicated important thing that we would like to explain to you and we are going to explain it to you using examples.” Formally we combine Bayesian generative models with case-based reasoning. So let me explain to you how BCM works using examples. So these are a set of recipe data where a data point consists of a list of ingredients to cook the food. It’s not about instruction, but just a list of the ingredients that you need to cook it. If you look at it can you cluster these data sets into say 3 clusters? >>: Crepes. >> Been Kim: Crepes, yeah. >>: Meat. >> Been Kim: Meat? >>: Mexican meat. >> Been Kim: Mexican, yes and there is one more. >>: Strawberry. >> Been Kim: Yes so strawberry dessert things. So those are three clusters. So, one way to cluster this data set is Mexican food, crepes cluster and dessert cluster. BCM is the clustering method where it clusters this data in a non-supervised manner and at the same time tries to learn the best way to explain each cluster. So in this case someone said Mexican food and we are lucky in this case because we somehow had a name for this cluster to describe this cluster, some extra abstract higher level idea. But if you worked with real data often when we cluster data we don’t have a convenient name to describe a cluster. So instead I can explain to you this cluster as the first cluster is like taco. If you have eaten tacos you know what’s in there, but really the important ingredients that define this cluster as a group is salsa, sour cream and avocado. The second cluster is like a basic crepe recipe where important ingredients are flour and egg. The last cluster is chocolate berry tart. The important ingredients are chocolate and strawberry. >>: So here is my comment to these two slides. >> Been Kim: Yeah? >>: You should do a user study where you can see where people can understand this information faster than in the previous slide. This slide to me I see the clusters much more clearly. I understand them much faster and I didn’t have to read anything. >> Been Kim: I have a human experiment that –. >>: In the previous slide I couldn’t easily see the clusters, but once you put things together like this I can see the clusters. >> Been Kim: So the entire birds eye view of all the data points, that’s what you are saying? >>: This is better than the previous slide. >> Been Kim: The previous slide? >>: The one before. >> Been Kim: The one before? >>: Yes, that’s hard. >>: This is not good, this is a jumble. This is good and I see where the clusters are. >> Been Kim: Right, right. >>: The next one is not as good as the previous one for me. >> Been Kim: Oh interesting, so you would like to see all the data points? >>: If I see cluster B I know what it is right. I don’t have to read anything. I see them and instantly know what it is and I see cluster A as the same without looking at the ingredients, without reading anything. I would rather see 5 examples of cluster B than 1 example and explanation. >> Been Kim: Oh I see what you mean. You want to see multiple examples. >>: So I wonder most people are like. Are they visual like this or are they methodical like you have on the next slide? >> Been Kim: I see, so this one is a way of I randomly selected examples and you are saying that multiple examples are better than just one example and subspace? >>: [indiscernible]. >>: My answer is counting things I can see, like the clouds and so on, but somehow I want to train something I can just spot rather than read through. >> Been Kim: So I have a human –. >>: But recipes would be interesting to see whether it is really true. I don’t know if it is actually true. Is it just me or do some people see it this way and other people really prefer to just see a few ingredients and read them rather than see these things? >>: That’s interesting because I prefer to read the ingredients. >>: The ingredients? >>: Yeah. >> Been Kim: I talk about this in the later slides where –. >>: [inaudible]. >> Been Kim: I talk about people who prefer a different way and depending on what domain experts you are working with people have different preferences and the interactive system that later in my talk is going to talk about what are the different ways? Not exactly in the way that you said, example verses distribution of the data, but how can we leverage what the preference of the domain expert wants and bring it back to the machine learning system? But to answer your question briefly and I will show you in the later slides we have a human subject experiment to compare this sort of representation verses not exactly like this, but non-example based methods. So instead of this I show you sort of a distribution of hot features of each cluster. So I will show you that in a bit. So BCM performs join inference on cluster labels and explanations, which are provided in the form of prototypes and sub spaces and formally we define prototypes to be quintessential observation that best represents the cluster and subspace to be sets of important features and characterizing the clusters. So there are two parts in BCM; clustering and learning explanation. Remember that in the inference that these two happen simultaneously, but for the sake of explaining I will divide this into two parts and explain. So the first clustering part we leverage widely used model called Admixture model. This is something that LDA uses as well. What Admixture model does and also allows flexibility in that way is that it assigns multiple cluster labels for a data point as opposed to assigning one cluster label to one data point. So if you take this sort of new data point, say this is a Mexican inspired crepe, and I talked about this in front of my friend from France and he got a little offended like, “What are you talking about a Mexican inspired crepe?” But if you have such a recipe then what Admixture model would do is assign the ingredients that belong to Mexican cluster to be cluster A and other ingredients that belong to crepe cluster to be cluster B. So you can think about if you were building a vector of Z where it has got some A’s and some B’s. You can think about another example where we have a crepe that has chocolate and strawberry. So some of the ingredients, chocolate and strawberry, will belong to cluster C where the other ingredients like crepe belongs to cluster B. You can also think about normalizing this vector to get cluster distribution vector. That’s a pie vector, it’s a common representation of Admixture model, pie vector right here. And if you combine that with the supervised learning method you can use it to evaluate your clustering performance and that’s exactly how it is done in LDA paper and that’s exactly how we evaluate our model. Thanks to a nice conjugate prior we have you have some control over how you want the cluster labels to be distributed using the hyper parameter alpha. The next part is learning explanation part. We have prototypes and subspaces as a way to explain clustering results. The first prototype is in the generate story. Remember it’s not going up to the inference story, but in the generator story. Prototype is simply a uniform distribution over the data point that you gave me. And this is the key portion of why BCM is intuitive because if you are a doctor and you gave me patient data, patients that you have dealt with in the last decade maybe then I will cluster them and the way that I will explain each cluster is using one of the patients that you dealt with. Subspaces are simply binary variable, one for important features and 0 for not important features. And together with prototype subspaces forms this function G that feeds into sample the fee latent variable which describes characteristics of clusters. So what is this function G? Well function G is really simply a similarly measure and you can use any similarity measure that you want, but this is the similarity measure that we used. This looks complicated, but this is actually the simplest possible similarity measure that you can think of. It’s as if you have a feature that is in the subspace and you have the same feature value as your prototype then I am going to score higher than other data points that don’t have that feature. So if I am a taco, well actually prototype is a taco and I am a taco salad, I will share avocado and sour cream with my prototype so I will score higher than say chocolate berry tart. >>: So go back to your N part, does that mean that you have a closed universe assumption that we cannot accept new items in this model? >> Been Kim: So I will not create a new item. >>: You will not generate a new item in this model? >> Been Kim: Yeah and if I do the assumption here is that they would loose the intuitive. You give me patient data and I will generate some fictional patient that you never dealt with. >>: So what is being generated in this model? I am a bit confused now. >> Been Kim: So are you familiar with the LDA generated model right. >>: [inaudible]. >> Been Kim: So this is just generated story how I –. >>: But you are not sampling your documents right? You are not sampling new patients? >> Been Kim: I am not sampling new patients, right. >>: Or are you saying you could generate new patients, but just take another set as the set for the purpose of getting prototypes? Like do you have that end and that end is limiting how many items you can have in a system? >> Been Kim: I see, uh-huh, uh-huh. >>: Or is that not true? You can actually in your model you can produce item N plus 1. >> Been Kim: So the answer to that is yes and no. So I guess I can, it’s a generated model so you can generate fictional prototype, but one of the points of this model is doing so would decrease how intuitive this model could be. >>: Because in the end you use the generated model for interpreting the parameters right? You are going to look at all the parameters and tell a story about it instead of fitting new items. That’s the purpose of the model right? >> Been Kim: Right. >>: A comment on that. So generating examples is really dangerous in a complex domain because you will get some detail probably wrong. I mean in healthcare for example if you were to generate a patient and you got something wrong, like say you had a male that was pregnant. You know unless you had sort of perfect generating structure you might make those kinds of mistakes and then experts will just immediately will ignore everything you do because it is sort of non-sensible. >>: Sure in some ways –. >>: So using real examples is sort of safe because they must exist. >>: It’s the same for LDS story; although it’s generative nobody in their right mind would actually generate new documents with LDS. It’s a matter of looking at the topics and then imagines the story as [inaudible]. >> Been Kim: Exactly, a really good example. So, NG function is similarly measure and you can use other similarly measures such as loss functions. It’s a pretty general model. You can peg similarity measure and that fits your obligation. So when I am working with interpretable models I can’t just show you that, “Here this is a model; I clustered model well and take it.” That’s not how it works. We have to answer all these questions in order to convince you that I built something interpretable. So first is sanity check. You said that you learned prototypes and subspaces; does it learn intuitive prototypes and subspaces that make sense? Second if its true interpretability is great, but we don’t really want to sacrifice performance for interpretability, it doesn’t maintain. Also lastly if the two are true then can this really improve human understanding about the clustering results? So I will present BCMs results by answering each of these questions. We ran BCM on two publicly available data sets. The first one is a recipe data set. This is from a computer cooking contest. One data point is a list of ingredients here. The first one is a soy sauce, chicken, sugar, sesame seeds and rice. So it’s a list of ingredients as Boolean features, it’s making an Asian inspired chicken dish. So those are our data points. When I ran BCM on this recipe data we learned 4 clusters. The first one looks a lot like pasta cluster and it selects herbs and tomato in pasta as it’s prototype and it learns oil, pasta, pepper and tomato as an important set of features. The second cluster which looks a lot like a chili cluster selects generic chili recipe as it’s prototype and learns beer, chili powder and tomato as it’s subspace. And if you are like one of my committee members who when is showed this to her and she was like, “Been something is weird, beer is should not be in chili. You have got to double check your model.” If you are wondering about the same thing I highly recommend that you should start putting beer in your chili. It really makes a lot of things better. You can put it in your stir fry. Beer makes a lot of things better, like triple and quadruple better. We also have a brownie cluster and it selected microwave brownie as its prototype selecting baking powder and chocolate as its subspace. >>: So you are showing part of the words as part of the words from that particular recipe. >> Been Kim: That’s correct. >>: Then you just highlight the ones that are the most likely. >> Been Kim: Important ingredients that describe the cluster. >>: Is there any hierarchy to the clustering or is it one level? >> Been Kim: It is one level right now, but I think we can extend it to a hierarchy cluster. >>: Can I say something about your inference algorithm? >> Been Kim: How do I perform it? >>: Yes. >> Been Kim: Yeah it’s coming, but I perform Gibbs sampling. >>: You do have discrete variables here right? >> Been Kim: I do. >>: You said easy to do sampling, like does it behave nicely? >> Been Kim: There is like some art about making Gibbs sampling work. It is about hyper parameter –. >>: [inaudible]. >> Been Kim: I found it to be okay and I kind of learned, this is not my first Bayesian generating model, I learned how to make it behave itself. But I think there are a lot of other inference techniques like variation inference that could also be applied to this model if I sit down and workout the conjugacy. >>: What about the naive approach where you just run your LDA or whatever on this data and then for every one of the clusters you find the recipe and just show that? >> Been Kim: Yes that will be on another example, in fact my human subject experiment explored that it exact way. It doesn’t pick the example, but it shows a list of ingredients that really looks a lot like a recipe. All right, another data set that I ran BCM on is hand written digit data. This is USPS hand written digit data and I am showing you 5 different clusters. On the right I zoomed in one of the clusters. On the left the first row is showing how it learns prototype as the Gibbs sampling iteration goes on from left to right and the down lower is showing subspace. So you want to see something that doesn’t make sense on the left and hopefully it will make more sense as you move to the right. The prototype looks like its learning digit number 7 and the subspace looks like 7, which makes sense. So we were encouraged by these 2 experiments and we said, “Okay what about the performance?” Yes? >>: I thought prototype has to be one of those end examples, but why do you have [inaudible]? >> Been Kim: This one? >>: Yeah. >> Been Kim: Yeah I get this question all the time, it is actually. So I double checked it’s a number 1 in their example. It’s digitally written hand digit data set. So if somebody put the pressure really low then its like 1, it exists. Okay, what about performance? We compare this clustering performance with LDA because it uses the same Admixture model to a model underlying data distribution. When we tested it for 2 different hand digit data sets and 20 new scripts, and the green one on the top is BCM and the yellow one is LDA, we show that BCM is able to maintain performance and often it performs better than LDA. We also perform sensitivity analysis. There are a couple of hyper parameters that we can tweak. We just want to make sure that we didn’t just get lucky by hitting the right hyper parameters. So we tested within a range of values for different hyper parameters and show that the performance didn’t change significantly. >>: So just to clarify this is classification right? >> Been Kim: Yeah. >>: So you are representing the object by its topics mixture of weight as the new feature representation? >> Been Kim: Yes, exactly. >>: Okay so here you are saying that somehow this transformation maintains [indiscernible]? >> Been Kim: Yes so I spoke briefly earlier, but I kind of went quickly. In LDA paper the way that they have tried to convince the readers that the clustering performance is good is that they use this cluster distribution as a new feature, so like you said, combined with SVM to produce this sort of graph in a [indiscernible] and that’s exactly what we did. >>: Not very convincing anyway, but I understand what you are doing and it’s good you see that it’s [indiscernible]. >> Been Kim: Yeah I think it’s difficult to put your hand down on the clustering method right there. Other methods like topic coherence or other measurements the topping model community came up with and it’s subjective, it’s difficult and you don’t have a ground truth. Yeah, it’s a difficult problem. >>: So can you explain truthfully why this new model BCM is better than LDA? >> Been Kim: It’s coming, I have a picture example of that. >>: So in this case you have one prototype per call or per topic? >> Been Kim: Yes. >>: Since you are performing in this way because you having a [indiscernible], why not have more than 1 per class? >> Been Kim: Yeah I think it would be a good idea to extend the model to have multiple models. It just makes things more expensive when it comes to inference, but I totally agree there are [indiscernible] working Stanford where they learn multiple prototypes to cover different ranges of examples. I think that’s a great idea, but it’s something that I haven’t done. Yeah? >>: So what are the features to the SVM? >> Been Kim: So feature to the SVM are this pi vector that describes cluster distribution. So I show you the Z vector where it collects As, Bs and Cs. You normalize them to get distribution vector. So like parameters for multinomial distribution. So it’s just like LDA. Okay so a pictorial example to answer your question, a pictorial example of why this might be true. If you think about posterior distribution of LDA and think about a level set where all the solutions on that level set scores equally in LDA model. LDA would pick any of these because they are equally good, but what BCM is trying to do is pushing this solution towards some other point that is also that is also equally good in terms of clustering, but also interpretable for humans, because in posterior of BCM this point would score higher than this point. You can also think about the way that BCM characterizes the cluster is working as a really smart regularize for this model. >>: So do you think human interpretability effectively creates a sort of margin? >> Been Kim: Yeah that’s how I intuitively understand why this might be the case or another example that you briefly mentioned, because we are clustering around the examples maybe that gives arise to a better solution because it exists. >>: So you are using the same number of topics? >> Been Kim: As LDA yeah. >>: I have a simple explanation for what was happening there; your LDA model cannot capture all the correlations in the data, but the data has those correlations and you are using the data in your model so that’s why you are getting this sort of hybrid thing that captures these higher level correlations that are not common. >> Been Kim: Yes exactly, that as well. >> That’s why I am saying multiple examples are even better. Then of course you have to compare with [indiscernible] approaches. >> Been Kim: Right, exactly. Yeah I normally learn how many examples do I need to learn the example? Yeah that would definitely be a good way to expand this work. All right. We talked about this, but I perform Gibbs sampling to do this. This is a good equation that I didn’t mean to walk you through where you can see that we see some bad functions by integrating functions. We integrate out fees and pi’s because we don’t need them, but we can forward generate those cluster distributions if we want to. So the last experiment that I did is; does this learned representation or explanation make sense to humans? One way to measure this is taking subjective measures. So you have people come in and you ask them whether you like explanation A verses B. A better way to do this in my opinion is taking objective measure of human understanding by asking humans to be a human classifier. So we explain 4 to 5 clusters using the 2 different methods, BCM and LDA, then we give them a new data point and ask them to classify where this new data point belongs to. We measure how accurately they can do that. When cluster is explained by BCM it looks like a list of ingredients. We don’t give them the name of the dish because it would be too easy. When it is explained by BCM it makes a dish, it’s an example. When explained by LDA it is the top K topic or ingredients for each of the clusters. So it look a lot like that it will make a dish, but it does not make a dish. So it is not an example. >>: Can you explain that again? I didn’t quite get it. >> Been Kim: How the clusters are –? >>: How they are presented. >> Been Kim: So this is A example, this example that I pointed out earlier. This is about some Asian chicken dish. These are ingredients that make that dish, this is an example. >>: So this is a particular example [indiscernible]? >> Been Kim: Yes whereas LDA I will just select top K ingredients of each cluster where when you look at it looks a lot like a dish, but it does not make a dish. >>: I see. >>: It seems that there are 2 things going on there. So there is the prototype and there is the fact that you have a real example, right. So I can take LDA and then randomly sample a data point and then look at its [indiscernible] cluster and see if the person agreed with it. >> Been Kim: Ah, randomly sample among the samples that belong to a particular cluster? >>: Yeah or just randomly sample a –. So the point is the prototype is special because it somehow maximally represents the cluster verses an arbitrary guy in the cluster. So both differences are good. >> Been Kim: Yeah. >>: [indiscernible]. >>: Or like perhaps salt is in everything so it’s not a good predictor, but if you don’t have salt then it doesn’t meet people’s perception of a recipe. >> Been Kim: Yeah true. There wasn’t a question right? >>: Well so I am suggesting that maybe the comparison is a little unfair in the sense that you are giving BCM not only a prototype, but a real example that LDA doesn’t get. I could have LDA prime that returns a real example as opposed to the top features. >> Been Kim: Right so we could compare the same sort of like a nearest neighbor example. >>: And I mentioned earlier off the LDA you just pick prototypes. >> Been Kim: Right, right. >>: [inaudible]. >> Been Kim: Yeah we could do that. I think that the way that I didn’t want to add on to –. Yeah no that could be a perfectly good baseline that we didn’t test. >>: You said something about your past dishes. >> Been Kim: Yeah. >>: Where do your truths come from? >> Been Kim: Where is it coming from? >>: Yeah. >> Been Kim: It is one of the data points from the data set that I mentioned earlier. >>: So it is one of those “and” things? >> Been Kim: It’s one of those “and” things yeah, because it’s a new example. As a classifier you take a new data and you classify them. >>: What do you mean a new example? So my question is do you leave that point out when you are trying your model? >> Been Kim: Oh I see, yeah of course, yeah. >>: So you actually have to try a whole bunch of models, leaving a bunch of dishes out? >> Been Kim: Why would I need that? >>: I am just asking you. Your specific dish, your test dish –. >> Been Kim: Would it be left out from –? >>: It came from our original corpus right, but when you trained your BCM model was that in the training set? >> Been Kim: No it shouldn’t be because then it would be –. >>: So what’s your protocol? Did you leave a bunch of dishes out? >> Been Kim: Oh I see yes, of course yeah. >>: So that’s what you did. >> Been Kim: Yeah, yeah, yeah. >>: How much did you leave out? >> Been Kim: I asked 16 questions for people. So I think I left out about 20 dishes when I was doing clustering. >>: And the true label of the dish how is that determined? >> Been Kim: These are all really good questions. So it is determined based on its name. So we know we had 2 independent human annotators and it has opened on human experiments and we have them sort of give them classes that they can label each dishes too and they label them. If you get it as a human it is pretty obvious which cluster this would belong to. >>: Sorry to dwell on this, but the clusters are with respect to what the BCM model has covered right? >> Been Kim: Uh-huh. >>: So have you compared against the pure computer system where you try all “and” items and just see where those guys fall, where those dishes [indiscernible]? >> Been Kim: So assuming that they would come up with some other clusters than what has been tested? >>: Right, okay I see the problem. We can talk about this offline. >> Been Kim: Okay, great. >>: So when people do this they are labeling, by your recipe, they are doing Admixture of the new recipe with respect to names and they are using name dish 1 and dish 2. >> Been Kim: Yeah so we don’t give them the actual name of the dish. >>: And how much time is spent analyzing this? There is a complexity here because now I as a human subject have to read the recipe and understand it. I do a lot of processing. So if I spend very little time I might be confused. I might be learning other tasks and the early examples I do wrong, but the later ones I do better. Also if I am allowed to name these things I might do better. >> Been Kim: Interesting. >>: If I am asked to draw what the recipe is going to look like at the end and make the recipe I may do even better. So I don’t know how that’s –. >> Been Kim: I see. >>: It’s a [indiscernible] vector, but also it is kind of an interesting research question. >> Been Kim: Right, right. >>: So what are your experiences regarding that? I know that was not the target, but. >> Been Kim: Yeah, no it is a part of the experiment where you make sure that in terms of the ordering the later one you do better than the first one. We have the prefect latent square; it’s just which question was presented earlier for 1 participant and some other participant so we can have a balance impact. In terms of naming things we actually, the first pilot study, we struggled to set people to the same granularity of recipes. What I mean by that is when I show them the cluster they were thinking “Is it this holiday brownie or is this mint brownie?” That’s the kind of granularity they were thinking. So we were like, “No it’s not”. So we gave them categories of candidate including some of the ones that represented the tier as a cluster and some of the other things of the same level of hierarchy to prime them to think about in this sort of level of space. >>: [inaudible]. >> Been Kim: So when we did that we get, you probably already this by now, we have 24 people, 384 classification questions and we got statistical significant results. What this is showing you is that this model learns reasonable prototypes and subspaces, maintains performance and it can improve human understanding for the test that we tested. So, moving onto the next part the closing; at the beginning of the talk I talked about how we want to leverage human domain expert knowledge back to the machine learning systems. In this part of work I extended BCM in order to incorporate human domain expert knowledge and implemented this for real-world obligation, the computer science education domain. So before going into detail, and I feel like this crowd would buy into this idea, let me convince you why interacting is important. So this is the data set that showed you in the beginning of the talk. I showed you that one way to cluster this data, and we worked this out together, is Mexican food, crepe and chocolate berry tart. But if you are an owner of a restaurant and you are trying to hire a pastry chef the way that might be really useful for you to cluster your menu items might be this; savory food on the left and sweet food on the right. So when you are evaluating your candidate you want to make sure they can cook these things. What this is trying to tell you is that depending on what you are trying to do with these clustering results what is most useful could be different. I am arguing that trying to figure that out interactively is a good idea. And of course leveraging interactiveness to deliver better clustering or classification results has been studied. A lot of people at MSR had studied this idea. Some of them assumed that the users know machine learning, so we help them to better explore more parameter settings, assuming that they know what different hyper parameter settings would influence the performance or some other word had assumed some simplified medium for communication. So users would internalize this medium to be something, communicate through this medium. And machine learning would take this medium, transform into something that makes sense to machines, but these two could be potentially different. The work that I’m suggesting here provides a unified framework where humans and machines can talk and communicate using the same exact medium, prototypes and subspaces. So we may BCM to be interactive and we called it iBCM, by making prototypes and subspaces and another node to be what we call interactive latent variables. These are like latent variables, but not quite because not only values of these nodes are inferred from data, but also feedback from humans. The key part why making interactive system is difficult is because you want to balance what data is trying to say and what humans are trying to say. You don’t want to completely override what data is trying to say with what humans said, but at the same time you don’t want to completely neglect what the human is trying to say. So our approach is decomposing Gibbs sampling steps to adjust the feedback propagation depending on their confidence. And also since we are working with an interactive system we want to make sure that the response time is really quick. You can’t have them wait forever. So we accelerate the inference by rearranging the latent variables. In the higher level, without going too much into detail we first listen to users, then we propagate user’s feedback to accelerate inference and then reflect the patterns of the data. We tested this to 2 different domains; the first one is abstract domain and then I am going to show you the real-world domain. Abstract domain is useful because we have a control over how we are going to make the distributions of the data to be. In this case we have generated this data such that there are multiple optimal ways to cluster the data. This is the interface that humans used to interact with your system. Each row is a cluster so I am showing you 4 clusters. The first column is a prototype so the second row says that the data point 74 is the prototype of this cluster and on the right I show you the features that it has along with subspaces. So if it has checkmarks it has got round patterns and it’s triangle and it’s yellow, but only these two features are in the subspaces. So it’s a triangle and its yellow and if you look at on the right which shows other items in that group it all has triangles and its colored yellow. There are 2 things that users can do; you can click 1 of the checkmarks to change to star checkmark to include that feature into subspace and you could click the star checkmark to make in the checkmark to exclude that feature from the subspace. Then you can also click any of these items to promote that item to replace the prototype of any of these clusters. >>: So the first cluster was the data over the round pattern start? >> Been Kim: Say that again. >>: Your first row is the beginning of a stared feature called “ram pattern” which I assume is the not lines. >> Been Kim: Oh yeah, yeah. >>: [inaudible] has overridden the users feedback and said, “Oh well I am going to throw lines in there because that’s what the data says.” >> Been Kim: So maintaining that balance is difficult. That’s part of what this is trying to do. The way that we ran this experiment is that we first show the subjects randomly arranged this whole data set and then we asked them, “What is the way that you prefer to cluster this data set,” in order to sort of learn the inherent or underlying preference. Then we show the results from BCM. So essentially what this does is it selects one of the optimal ways to cluster this data. Then we asked subjects, “How is this matching with your preferred way to cluster?” And we collect that data according to [indiscernible] scale. Then we have subjects interact with iBCM. So they can select/un-select the subspaces from the prototype to make this cluster more like what they wanted. Then we ask the same question again to indicate how well the results match with what they want. We compare the answers collected in these two steps, we asked for 24 participants under 92 questions and subjects had expressed that they agreed a lot more strongly that final clusters matched with their preference after their interaction and that was statistically significant. So encouraged by this idea we now take this iBCM to a real-world domain, to education. >>: Did you test for the placebo effect here? I mean maybe they just like the final clusters because they contributed. >> Been Kim: I see, we collected –. >>: [indiscernible] a random cluster, I mean just noise to your cluster and say, “This is a response to what you have said.” Did they still prefer the new ones to the old ones? >> Been Kim: Because they just fell like they have interacted with it. >>: Yeah they are committed to it, invested in it. >> Been Kim: Interesting idea. So this might be slightly [indiscernible], but to answer your question we didn’t test that, the placebo effect. But one of the ideas of the interactive model is that often when users start to interact with interact assistance they don’t actually know what they want. As you interact with it you kind of learn, “Okay yeah, this is what I wanted,” and it’s like a way for them to interact with the system to figure out what they want as well. If the final clustering is really not what they wanted it’s just that simple placebo effect I personally don’t think that would be the case, but it’s something that we could totally test. >>: Can you go back to the previous slide. I think I have a related question here. What do you mean by the first step? >> Been Kim: So the first step is really trying to fetch what humans wanted initially. What are their underlying distributions? >>: But to find the subject I am supposed to arrange those objects into clusters? >> Been Kim: I see, no we asked them, “What are the features you want each cluster to have?” Because there are only 6 features here we said, “Cluster 1 you wanted cluster to have line feature pattern and color.” >>: I see, I see, but what about I just suggested? Maybe half the humans just free play and form clusters. Then you have an objective way to measure. >> Been Kim: Yeah, yeah, yeah, the way that I measured this “P” value is just simply comparing these two, but you are right. If I had compared it with like original then the placebo effect could be tested. But I looked at it, scanned it and I don’t have a number for it, but they are roughly similar, but I don’t have a number for you right now. >>: For the teachers who are predefined you gave them the features? >> Been Kim: The features of the abstract data? >>: Yeah the circle random [indiscernible]. >> Been Kim: Yeah, yeah, because those are generated. >>: Have you thought about just letting them come up with features and just give them an empty table and say, “You fill in whatever you want,” and then other people using the features that somebody else has come up with and so on. >> Been Kim: That’s interesting, that would be interesting, but I don’t know how you would evaluate that. If you are using different features –. I guess if the only thing I am going to take is [indiscernible] scale then I can probably evaluate it, but that would kind of introduce another factor that defeats the purpose of using abstract domain, because the point of using abstract domain was that we have this clearly defined world where only we allow this many features. >>: [indiscernible]. >> Been Kim: No a completely different set. There is no overlap between those two, because the first people knew what I was doing. So I took the iBCM encouraged by the previous results to the real domain and the education is a particularly appropriate domain form iBCM and interactive machine learning because teachers have accumulated years of knowledge or maybe their philosophy of how an introductory Python programming class should be taught and we want to leverage that and deliver something that is useful for them. In particularly the domain that we were looking at was how they create grading rubric. So, exploring a spectrum of student’s submission for homework problems. Currently what teachers do is they select randomly 4 to 5 assignments, they look at them and scan them in order to create the homework, but we know that if teachers can understand better the variation of this data they can provide better tailored feedback for the students and ultimately hopefully improve the education experience for the students. But it is difficult working with coding data because we don’t have the obvious features as we do for other types of domains. So we leverage a system that Alena Glassman and Rob Miller at MIT developed called OverCode which performs static and dynamic analysis in the code to extract the right features. So we use those. This is the system we built. On the left we have prototypes. I am showing your 3 clusters. The last one is blue because it is clicked. When it is clicked on the right it is showing you other items, other homework submissions that belong to that cluster. And these are from a couple of years of homework submissions that were collected from the MIT introductory Python class. The red rectangles are subspaces of course, the keywords. Humans can interact with the system using 2 things; similarly unselect/select subspaces. So make a keyword going inside of the red rectangle or not and also promote certain examples to be prototypes. So if a teacher likes this prototype then they could promote this to replace any of the prototypes. How we performed this experiment is comparing 2 interactive systems. We made the benchmark system interactive to keep the engagement level fair for both of the interactive systems. The first one is the BCM system, the one I just told you and the second one is still interactive, but where subspaces and prototypes are preselected for the users. The way that this is generated is that we ran BCM with new random initial points. So it converts to different clusters, but it’s still optimal according to the internal metric of the clustering method. So when users click that button up there it just shuffles around and shows you a new cluster. We invited 12 teachers who previously taught Intro to Python class at MIT and we told them that their job was to explore full spectrum of students’ submissions and write down a discovery list. So any features or interesting things they find when you are looking through this large student submission to write them down. This is a video demonstration how the teacher might use the iBCM system. So first the teacher goes to the first cluster and because “while” is in the subspace most of the submissions on the right has the “while” keyword in it. But the teacher scores down and finds that everyone used “while” until they find something interesting. Some person imported an [indiscernible]. If you are familiar with Python this is like importing a library. The teacher said like okay let’s take a further look at this by promoting this guy to a prototype and then indicate what’s interesting to the teacher. Then look at all the other people who imported a module. Somebody imported math, pi, [indiscernible] tools and then the teacher moves onto the next cluster and is checking through this until they find something interesting. This student is checking length of these two vectors and depending on what teacher you are talking to some people think that’s a good idea and some people don’t think it’s a good idea. So the teacher goes through and updates that to the prototype and continues to investigate the student submissions. >>: So do I understand correctly that you have a fixed number of clusters, hence prototypes and if I found something potentially interesting I need to replace one of my existing [indiscernible]? >> Been Kim: Yes, yes this is a good question because that’s why the task of the how we envisioned this tool could be used was more exploration. It’s not about clustering the data set. Once you explored and discovered things from the existing clusters our idea was that you can now forget about that cluster, update that with that to some other cluster, make it into a different cluster and continue to explore it. >>: Yeah I understand, this is essentially what we call “island finding”, finding new things. But I’m a little bit worried that because of your iBCM system would there be some coupling by the fact that I need to push down some existing clusters that might hide something. >> Been Kim: Yeah totally, totally, like I feel like we worked on this together or something. Yeah we had that effect. We started this and the way that we had overcome that issue, and I think it’s still a really important issue to overcome when you are working on interactive clustering methods, is that you need to explain how the system works. And explaining that to someone who doesn’t know machine learning is still a very challenging problem. They need to know that these two clusters interact. They share some information so if you make an entire cluster to suck up all the submissions of one particular type you are not going to get anything like that for the other clusters. >>: Another potential solution is to go to basic non-parametric. Instead of a fixed number of topics you can go the route of things like [indiscernible]. >> Been Kim: Right, but that doesn’t solve the problem of interacting clusters. That only shows you that it is going to be another cluster, that if there exists enough evidence I am going to create another cluster. >>: I can create a new cluster. I can construct a new one. >> Been Kim: Right, right, totally, yeah, yeah, but to explain the nature of the clustering method is very difficult. So the way that we did is we had trained these people to use the assisted behavior that you would expect and now they choose it. Yeah, good points. So we invited 12 subjects who previously taught the class, we showed them 48 problems and we asked them 15 [indiscernible] scale questions where 12 of them resulted in statistical significance. They fall into one of these categories over here. With iBCM they said they were more satisfied. They better explored the full spectrum of student submissions. They better identified important features to expand the discovery list. They thought that important features and prototypes are useful. So some other codes from the participants they said that iBCM enabled them to go in depth on how students could do. They found it useful in particularly large data sets where brute force would not be practical. And this is particularly encouraging because in the rising of [indiscernible] where teachers now have to teach not just hundreds, but thousands or maybe more students maybe this interactive system can help them to better explore student submissions and ultimately deliver better education experiences for the students. >>: But here you again have that same effect as before because if you just give them some clusters and they are spent using them as opposed to you give them an opportunity to play with the data and understand where clusters are going and come up with almost the same solution themselves. They might prefer that second path to the first path, which doesn’t mean the clustering itself is better, it means that exposing technology and making them understand the technology is what actually helps. >> Been Kim: I see, that’s the experiment that I did in the interactive experiment. That’s exactly what I did. This is a pre-cluster thing and this is something they can interact. I asked which one they preferred and they preferred the first one. >>: Yeah, but then the question is, “Do they prefer this because this whole system helped them learn or did they prefer it because they ended up with better clusters?” I mean both are valuable. >> Been Kim: What do you mean by better clusters here? >>: So in the end when they create these new prototypes by moving things around they get a different set of prototypes than you would have gotten with BCM applied directly. >> Been Kim: True, but BCM –. >>: Or LDA and then choosing the best matching prototype. >> Been Kim: Right, but BCM is also optimal. All of these clusters are good; it’s just a matter of which one matches with your idea of clusters better? >>: But even that, the issue is that when you just provide the clusters I have no idea what the clusters are. When I do this I get an idea of what the clusters are. That doesn’t mean that –. >> Been Kim: I see just learning about the [inaudible]. >>: I mean we could have started with the clusters that you end up with, you run this experiment and end up with good prototypes for one teacher, but you give these prototypes to the next teacher and they go back to something else. If you run this in a circle like this they might just be going all over the place. They are all happier with the resulting clusters, not because these clusters are better for them, although that’s the illusion you get, but simply because by going through all of this, to [indiscernible], they start understanding the ecology. >> Been Kim: I see got it, got it. So this second tool kind of gives you that feel by giving you different ways to cluster the data. The users can see 5 different or 10 different ways to cluster the same data. That gives you an opportunity to learn. >>: But I think that’s where the value is rather than in getting the optimal clusters, maybe they are not getting the optimal clusters. They are just getting a set of clusters, but now they understand what they mean. >>: So related to this what is your quantitative measure for this result? >> Been Kim: It’s the 15 [indiscernible] scale types of questions, post questionnaire questions. >>: But this is done per subject, right. I am going to evaluate my own. So, maybe one solution is to do a cross-subject evaluation. I evaluate your resulting [indiscernible]. >> Been Kim: So we thought about this idea, but then there is another conflict that what we were trying to do here is to leverage that teacher’s philosophy of teaching. We have like 10 to 12 features coming in to the room and they have really a completely different idea of what good homework assignments should be. Like the “assert” example some people think, “Oh yeah of course you have got to check the length of these two vectors,” but some teachers are like, “No you don’t do that, it’s a waste of line. You just assume that given inputs are all that he checked.” So these things are very different and if I ask you to check my results chances are it’s probably not ideal for you. >>: [inaudible]. >>: I just think like a feature comes into your system and starts messing things around. So like the cat comes onto the mat and she is going to start moving the mat around, the mat is going to be exactly the same when she lies into the mat, but she thinks, “Well now it is better than it was before because I have done something to it.” >> Been Kim: My assumption is –. >>: What you do to it is not to “it”, you are doing something to yourself. You are preparing your self to lie on the mat. >> Been Kim: Right, so my assumption to my work is that humans are better than that. Humans have some expert knowledge that we can really leverage in the system. That is an assumption to my work. >>: But it’s testable and either way it’s useful, even if that’s what happens, that the teacher get’s used to this tool, that’s still useful. >> Been Kim: Right, how is it testable that you are doing? >>: I think that’s the answer actually. >>: You could test for coverage or recall of different features. So just using the tool collectively on many different features can be discovered. >> Been Kim: So these are things we that we thought about by collecting the discovery list, what they wrote and how much of the coverage that is compared to like the Oracle set of features of the whole assumption. That’s all subjective, we have to somehow come up with the Oracle list which is subjective to begin with and being able to count when the teacher writes this features. The teacher might mean two different things and having that just really lined out one by one in Boolean vector is really hard. So even calculating this absolute value measuring this is actually really hard to test. Okay so I talked about machine learning models that can make sense to humans by building Bayesian Case Model that could provide intuitive explanations using examples. Then I extended this idea to be interactive, implemented in a real-world domain computer science introductory Python class to show how this could be useful. I think there are a lot of really exciting things that can be done and interpretable in interactive machine learning models. One other thing that I worked with briefly and it deserves a lot more attention is visualization. I worked with this idea briefly when I was an intern at Google where I worked with software engineers whose job is to look at this very complex data on a daily basis and explore the spectrum, which is pattern data. I looked at different feature reduction methods and compared them to see what the ways are to represent the same data such that they could better explore the distributions of the data. Another thing that I think is very important and I think it goes back our discussion earlier is looking at what is really your need in a specific domain? So one of the systems for the project that I worked with was working with a medical domain expert who looks at autism spectrum disorder data and for them a meaningful cluster is not just clustering that data, but really figuring out what are the features that distinguishes 2 different clusters? What is the difference between A and B? What’s the difference between B and C to help hypothesis generation for example? So we built the model to learn those distinguishable features. Another thing that I think people like [indiscernible] at MSR were looked at is using this interactiveness as a way to help data scientists. To debug models and better explore hyper parameters. And now a lot of companies are hiring data scientists where their daily job is to look at this model, stare at this model, and look at what are the hyper parameters that are a better fit for my performance? How do I achieve that performance that I want? I think we can use the interactiveness to really make their job easier and more efficient. So I work at Ai2. At Ai2 we think one of the problems that will help us to take a step forward in AI is building a machine that can answer a fourth grade science exam. Here is an example of a fourth grade science exam, “In which environment would a white rabbit be best protected from predators?” The examples are: shady forest, snowy field, grassy lawn and muddy riverbank. When you look at this question as an adult hopefully you see that the teacher is trying to test the idea of camouflage. If you know that abstract concept called the camouflage you might be able to apply that to so answer this question. So what I am trying to do at Ai2 is: How do we learn these abstract concepts in an unsupervised manner from a large corpus? Once we detect this abstract concept then maybe we can answer these questions using this sort of abstract idea that humans had defined. And one of the keen sides that I think will help us is that the learned graph that represents this abstract concept should make sense to humans in order for it to be a concept that humans had decided to define such as camouflage. This involves one of the first ideas that we are exploring which is using word embedding representation, this is just learning mapping a word to a point in a vector space and use that sort of representation to learn the context of which it represents the camouflage. That extends the assumption that I had in my earlier work where given features to me, the data set has already interpretable features. It is no longer recipe or patient data, it is some feature that we don’t know what they are about. >>: [inaudible]. >> Been Kim: Oh this matrix is actually [indiscernible] representation of this particular question. I just put it there because –. >>: So the entire question is one point? >> Been Kim: Just one question. >>: So this entire question and then there are answers. So it is just the question without the answers that are being embedded? >> Been Kim: I see, so what I did here to make this matrix is I select a bag of words from the question, this is very preliminary, we are just starting to investigate and then I extended them with their neighbors in [indiscernible] space to kind of have a smoothing effect. So we are not only looking at rabbit. I am going to look at fox and raccoon. Then we cluster them with respect to rows and columns to learn that these are the words that common equal occur set of words and these are the context that common equal occur and then I am hoping to learn a graph out of it. >>: So is there sorting on the –? >> Been Kim: Yes. >>: They are sorted by? >> Been Kim: They are clustered. These are one cluster in a row. >>: Oh I see. >> Been Kim: I like how attention is directly going to the matrix. What I think would be a really interesting topic to work on is that this deep learning has achieved much success in various fields and I think in my opinion one of the drawbacks of deep learning is that it is not interpretable. What we really want to do is for people to gain more insight into this powerful tool and ultimately really make better decisions. This part looking at how we can assign a learn interpretability from such non-interpretable data is another interesting research topic. >>: But we could make the argument that any model is interpretable in the sense that I can always remove the feature, retrain and compare it to model and now I have an explanation of the power of the teacher. >> Been Kim: How would you compare 2 models? >>: Oh compare 2 models? >> Been Kim: You said you remove a feature and then you compare 2 models. How do you compare? >>: So I want to know if the effect of that feature on this model. So I train it with and without the feature. If the performance goes down a lot then I know this feature as important. So now I can sort all the features. So even though I don’t know what’s happening inside I know the effect of all the features, which is a way to interpret the model. >> Been Kim: I see yeah, this is another idea that I actually thought about when you are working with like dim learning or neural nets. Removal [indiscernible] and see what that does in order to assign a meaning to it. I think that’s a good way to go about it, but it’s not in my opinion is not direct enough. If you have thousands of neurons in your layer you are going to have to train thousands models and training one model takes 4 weeks. >>: [inaudible]. >> Been Kim: If you have an idea about this –. >>: [inaudible]. We tried that with our splicing work. We can predict how certain genes can be spliced, but now for biologists we want to interpret it and say, “Well this is what does that and this is what does that and so on.” It turns out that features are so correlated that we couldn’t get anything interpretable this way. You almost have to –. You should take derivatives but with something more complex and then you have to kind of interpret that too. So it is not as easy. >>: But basically when we say things are not interpretable I kind of object to this because there is a way to interpret what’s happening. There are things you can do. You can move things from train to test and that tells you a lot. You can train with or without the feature and that also tells you a lot. You can always remove a feature and then retrain without the first feature –. >>: Yeah if you want to spend a lot of time everything is interpretable, if you want to spend a ton of time. So maybe –. >>: No, no –. >>: [inaudible]. >>: [inaudible]. >>: Yeah, but once you hold out a feature what you get is you now have to hold another feature. You have to then go back after you group the features and say, “Now how about this group of features verses that group.” It’s an interesting process and it becomes, we fail. In the end we just couldn’t do the interpretation of the new one. >>: [inaudible]. One thing you could do though, which is close to what you are already doing, is you can still do a case based kind of analysis of what [indiscernible]. We go somewhere into the internal representation, do Euclidean distance on the internal representation and then you can still find cases that are similar to each other, similar to a test example. And if you believe that the cases are intelligible to humans then you can explain the reasoning of the model by presenting these cases to humans. >> Been Kim: Yeah, yeah, that’s true. I think this comparing feature idea is for the exact reasons that you mentioned. For those listening in and can’t hear you guys is exactly right. It quickly explodes in order to enumerate all these questions and because of this correlation you can’t really tease them out. The exact meanings that is [indiscernible]. >>: But it’s not a bad place to start. >> Been Kim: It’s not a bad place to start. >>: It’s actually a complex research issue. It’s not easy. >>: [inaudible]. >>: Yes, that’s true. >>: [inaudible]. >>: That’s why I was thinking that example you had with circles, if you actually had an empty table that people added features [indiscernible]. Do they actually end up in [indiscernible] or not? >> Been Kim: That’s interesting, yea. For the coding example I think they definitely come up with different features. >>: [inaudible]. >> Been Kim: For the coding experiment that I did, when I asked them to write down the discovery list that’s kind of identifying the features of the human it was all over the place, very different. >>: [inaudible]. If they are all working on that shape problem and then as somebody adds a feature it populates everybody’s list and they see, “Okay I just click here instead of adding my own.” >> Been Kim: [inaudible]. >>: Then you start automatically raising some rows which nobody is using and so what happens? >> Been Kim: Yeah that’s interesting. That’s more like a collaborative way of doing it. That will be a next step or maybe together. Okay so the last thing I was going to say is that Ai2 is just down the road in a new district. If you are interested and care for machine learning that would benefit humans I would love to chat. Thank you. [Applause] >> Rich Caruana: So feel free to yell, but if you have got another question or 2 we have got another 5 or 10 minutes. >>: I have got a question which is something you sort of skipped in the iBCM. You described the way to inject human interaction as a procedure in your Gibbs sampling. So what is the mathematical view of that? So changing Gibbs sampling algorithm is a procedural thing [inaudible]. Is there an equivalent mathematical view to say that humans are essentially [inaudible]? What is that? >> Been Kim: It’s kind of providing a new data point, but it’s a more important data point than an original data point for example. I don’t know if it’s mathematical, but that’s the intuition I am going for. If you have a data point and the data point is yelling at some pattern, if you are human and you don’t see the pattern that you looking to see then it is kind of like creating another data point. And actually this view has been studied; I think it’s a paper from MSR that viewing human’s interactions as a set of new data points or like a curriculum of like a principal example. That’s sort of the view that I’m applying here. >>: Okay. >> Rich Caruana: All right, thank you again Been. >> Been Kim: Thank you guys. [Applause]