>> K. Shriraghav: It's my pleasure to introduce Aditya Parameswaran who has been a former intern with us from Stanford University to come here for a talk. Aditya has a lot of publications in [inaudible]. His [inaudible] count is more than 25 I think actually. Some pretty large number. And -- Sorry? >>: A crowd of publications. >> K. Shriraghav: A crowd of publications. >> Aditya Parameswaran: You know my secret. >> K. Shiriraghav: In fact two office papers have been among the best papers of their respective conferences, so he has a very distinguished record for a graduating PhD student. And he's here to tell us about human-powered data management. So on to you. >> Aditya Parameswaran: All right. Thank you [inaudible] and thank you for inviting me. It's a pleasure to be here, a pleasure to be back actually. All right. So I'm Aditya Parameswaran from Standford University. I'm going to be talking about human-powered data management. All right? So we are now in the midst of the big data age. Every minute we have 48 hours of video upload on YouTube, 100,000 tweets and so on. Understanding this data would help us power a whole range of data driven applications. Unfortunately an estimated 80 percent of this data is unstructured so it is images, videos and text. Fully automated processing of unstructured data is not yet a solved problem. Humans, on the other hand, are very good at understanding unstructured data. We are very good at understanding abstract concepts. We are very good at understanding images, videos and text. So incorporating humans doing small tasks into computation can significantly improve the gathering, processing and understanding of data. So the question is how do we combine human and traditional computation for data-driven applications? So let me illustrate the challenges using a simple example, the task that I actually wanted to do for this presentation. So I want five clipart images of a student studying of college age. There must not be a watermark, and it must be suitable for a presentation. Okay? So it's a simple enough task. So the first option is I do it myself, right? But this could take very long. I need to figure out which queries to issue to Google -- or Bing Images. [laughter] Faux pas right at the start. So I need to figure out which queries to issue to Bing Images, and for each of those queries I need to go through hundreds and hundreds of results so it's really painful. The second option is I ask a friend. Once again it could take very long. The person might not do a good job, right? The third option is to orchestrate many humans. By orchestrating many humans I could get results faster because I'm [inaudible], and because I'm using many humans I may have low error. All right? Of course orchestrating many humans gives rise to many different challenges. First, how should this task be broken up? Presumably I need to gather images from an image search engine. Which query should I issue? How many images should I gather? How should I check if these images obey properties? To guarantee correctness I may want to check one property at a time. In what order should I check these properties? Since humans make mistakes, I may want to ask multiple humans. How many humans should I ask? How do I rank these images? How do I optimize a work flow? How do I guarantee correctness? So these are the kinds of challenges that one needs to grapple with when orchestrating humans for humanpowered data management. All right? So these challenges boil down to a fundamental three-way trade off that holds in this scenario: tradeoff between latency -- How long can I wait? -- cost -- How much am I willing to pay? -- and quality -- What is my desired quality? So recall that in traditional database query optimization the focus is on latency, in traditional parallel computation the focus is on latency and cost, and traditional on certain databases the focus is on latency and quality. In this case we have a three-way tradeoff. So if there's one thing I'd like you to take away from this talk it's our focus on this threeway tradeoff that permeates all of my research on human-powered data management. So to get access to humans we need to crowdsource. All right, so here is our diagram of the landscape of crowdsourcing in the industry by an organization called crowdsourcing.org. As you can see it's a very active area. Each of these tiny icons refers to a separate company. Crowdsourcing means a lot of different things to different people. It could mean generating funding using the crowd, generating designs using the crowd, solving really hard problems using the crowd and so on. In my work I focused on cloud labor or paid crowdsourcing. To get access to cloud labor or paid crowdsourcing, one uses marketplaces. So all of the icons in this figure refer to a marketplace. Marketplaces allow users to post tasks via low-level API [inaudible] people who are online can pick up these tasks and solve them. And the economical example of a marketplace which I'm sure a lot of you have heard of is Amazon's Mechanical Turk. These marketplaces are growing rapidly. The size quadrupled in 2010 and 2011 and the total revenue reached 400 million dollars in 2011. So here's an example of a task that I could post to one of these marketplaces asking people, "Is this an image of a student studying?" People who are online can pick up this task and solve it and will get the fivecent reward, in this case. All right? Okay, so now let me draw a diagram of the landscape of research in crowd labor or paid crowdsourcing and tell you how my work fits in. So there are lots of humans; there are lots of marketplaces like I described earlier to reach humans. And once again the canonical example of a marketplace is Mechanical Turk. Then, there has been work on platforms making it easier to post tasks to these marketplaces dealing with issues like how should interact with humans, what kind of human issues arise, what kind of interfaces should one use, and so on. Then, there is work on algorithms that leverage these platforms having humans do the data processing operations, operations like comparisons, filtering, ranking, rating and so on. Then there are systems that call these algorithms asking these algorithms to sort, cluster and clean data. Of course these systems can also directly leverage the platforms by having humans get or verify data. My focus has been on designing algorithms and systems. So the focus of my thesis has been on designing algorithms and systems and efficient algorithms and systems for human-powered data management. There are four aspects that I've studied: data processing, data gathering, data extraction and data quality. I've also worked on other research that doesn't fit under the umbrella of human-powered data management. If there is time, I'll tell you about that too at the end of the talk. All right. So here's the outline of the rest of the talk. I'm going to tell you about two human-powered systems or applications that both motivate and are influenced by my research. I'll tell you about one of them immediately and the other one interspersed with the second topic which is filtering. Filtering is a critical data processing algorithm that applies to both of the systems that I will talk about. I'll tell you about the other research I've done in crowdsourcing and in other topics and then conclude with future research or open problems. All right, so the first application or system that I'm going to be talking about is the DataSift Toolkit. So the DataSift Toolkit is a toolkit for efficiently executing a specific kind of query, the gather-filter-rank query on any corpus. So the idea is you gather items from the corpus. You filter them. You rank them and then, you produce the result. And humans may be involved in all three steps: the gather step, the filter step as well as the rank step. And what we've built is a general purpose toolkit that can be efficiently deployed on any corpus. All right, so let me dive into a quick demo. So a user of DataSift will see a screen like this. They'll select what they're looking for, in this case Google Images. Sorry, I don't have one for Bing Images. But they'll select the corpus that they're looking for, type in what they're interested in. The conditions the items must satisfy -- These are filtering predicates -- and how the items must be ranked so the ranking predicate. They can also specify how many results they want and how much they are willing to spend. All right? Now I'm going to play a video of how I would use DataSift to ask my query. So currently DataSift is implemented over four corpora: Google Images, YouTube, Amazon Products and Charter Stock but it could be over any corpus. For my clipart of student studying example, I would type in, "Give me a clipart of student studying. Must be one student of college age," and so and so forth. Right? Unlike a traditional search engine, notice that I can use as many words as I want to describe each of these predicates. Let's say I want ten results, and I say that my budget is, let's say, five dollars. Okay, so this is how a user would post a query to DataSift. DataSift will translate this specification into questions that are asked to the crowd. So there are three steps: the gather, filter and rank step. So it'll actually gather items, in this case images, by issuing keyword search queries to the corpus, in this case Google Images. So in this case it'll gather items by issuing the keyword search query "clipart of student studying," so the crowd is not involved in this gather step. But the crowd could also be involved in the gather step, and I'll show you an example of that next. Then, DataSift checks if the images retrieved actually obeyed the filters using humans. Then, DataSift ranks the items using humans and then, presents the results to the user. All right. Now I'm going to show you results from previous runs for this query. So this is one such result. This is just a portion of the result; there are lots more below. So the first column here refers to the rank given by DataSift for this query. The second column refers to the rank given by Google for this query for clipart of student studying. As you can see the first four results are all fairly good. This is one student of college age sitting with books. There is no watermark and so on. Another interesting thing to note is that the item at rank four was actually ranked seventy-eight in Google. So DataSift managed to pull it up. Now I'm going to play a video of me scrolling down so that we look at the rest of the results. Okay, so then you have items with DataSift rank five, six and seven which are also fairly good. Then you get to items with DataSift rank minus one. So these are items that DataSift discarded during processing because it felt that it did not satisfy one or more of the filtering predicates. These items were fairly high up in the Google Search results for clipart of student studying. Right? So they were ranked three all the way to sixteen. If you go and manually inspect each of these images, you'll indeed find that they do not satisfy one or more of the filtering predicates. The typical one that is not satisfied is the "no watermark" restriction, and sometimes it's not even a clipart of student studying. All right, so the results are fairly good for this example. Now let me try to give you an even more compelling example. So in this case I'm looking for a type of a cable that connects to a socket that I took a photo of. So notice that I can add a photo as part of my query. And in this case I'm searching over the Amazon Products catalogue. So in this case DataSift does not even -- Yes? >>: Sorry. Just to understand: how would that photo be leveraged during the gather phase? Right. >> Aditya Parameswaran: I'm getting to that. >>: Okay. >> Aditya Parameswaran: Yeah. So in this case DataSift does not even know what keyword queries to issue to the Amazon Products corpus, as you rightly pointed out. So it asks the crowd for keyword query suggestions in the gather step. Right? And then it retrieves items corresponding to those keyword query suggestions, all in the gather step. Then, in the filter step for those items retrieved it checks whether it satisfies this query or not. So the items -- And there is no rank step in this case, right, because you're not ranking based on any predicate. So the items are fairly good. All of these are indeed cables that would satisfy my query. Now I'm going to scroll down so that we can look at the results that were discarded by DataSift. So here you can see an item with DataSift rank minus one. This is in fact a Mini-B cable. It is not a printer cable. This is scanner PC interface cable which I don't know what that is. Then there's a micro cable. Then there's a printer which is not even a cable. And the further down you go, the more strange the results get. Right? They're not even cables beyond a point. All right. So once again DataSift does a fairly good job even for this query. To summarize, DataSift is a toolkit for efficiently executing gather-filter-rank workflows on any corpus. There lots and lots of applications of DataSift. How do you help school teachers find appropriate articles to assign to students? How do you help shoppers find desired products? How do you help journalists find supporting data or supporting images? And there are lots of challenges in building DataSift. How do you make it flexible and extensible? How do you optimize the individual operators of gather, filter and rank? And how do you optimize across the entire work flow which is something we haven't yet addressed. So I won't have the time to get into the detailed design of DataSift; I'm going to talk to you about just the optimization of just one of the operators, specifically the filter operator. All right, at this point are there any questions? Yes? >>: Yeah, just the two examples you just mentioned, right, so for example how do I find the right reading materials for students? With image recognition that's something anyone can do, so people can look at them and say, "That's a student." Finding out what is an appropriate book for a classroom is something that almost no one can do if you pick out sort of random people off... >> Aditya Parameswaran: True. >>: ...the Internet. So is that really an adequate task here? >> Aditya Parameswaran: True. So it is true that in some cases if the material is too specialized it might not be appropriate to use a general purpose crowd. But the thing that I had in mind, the use-case that I had in mind was a little more simple. Let's say I want to assign articles on global warming, and I want articles that are well written that have, let's say, neither a liberal bias nor a conservative bias. It sort of has a very neutral bias. It considers the pros and cons of both arguments, and it is from a reputable source. Right? So these are things that anyone can check, I think, anyone who has English background. Of course if it's a very detailed technical icon, for instance use DataSift to ask for related work to my publication. That's something that I can't use it as of now. But I suspect once we have a better sense for skilled workers or skill sets of workers, I think we can get there eventually but not right now. Any other questions? Yeah? >>: Why do you divide your task into gather-filter? [inaudible] I thought you [inaudible] the system as just small, general purpose query, but you are choosing this particular division called gather-filter-rank. Tell us your examples. >> Aditya Parameswaran: So I will talk about another system briefly which does general purpose computation, but this was a specific enough task. Even though this is very specific there are lots and lots of applications that fit under this model so it [inaudible] like detailed investigation. So one aspect that is different -- Although, the other system does have filter and rank competence, it does not have a gather competence. The gather competence is very new to the system, so I don't think there are corresponding competence in the other system. But I'll get to that. Yeah? >>: My related question is, if you look at databases that relational [inaudible]. >> Aditya Parameswaran: Yes. >>: These look sort [inaudible] those relational [inaudible]. >> Aditya Parameswaran: [inaudible] >>: Do you have language? Because typically the interface for databases align with [inaudible]... >> Aditya Parameswaran: Yes. >>: Everything is implemented using composing operators. So these are the operators you used to compose them? Is that a language interface or --? >> Aditya Parameswaran: No, in this case this is the interface. It's only gather, a sequence of filters and a rank step; that is the restricted language that I can handle. But our toolkit is flexible enough that you can plug and play these operators if you really want to, but we are not supporting queries. >>: But then it's almost like the relational [inaudible]. >> Aditya Parameswaran: But only for filtering and ranking, that's it. No complex operations like... >>: These are the set of operators. >> Aditya Parameswaran: Yeah, these are the set of operators. >>: Well the gather is like the [inaudible]. >> Aditya Parameswaran: Yes, in a sense. In a sense, yeah. >>: [inaudible] very interesting about all this is I mean do you sort of envision kind of a marketplace of -- I mean, in some sense both gather and filter, depending on specialized the question is, you might be willing to pay more to get people with... >> Aditya Parameswaran: Yeah. >>: ...different levels. It's almost like a matchmaking service between people who have specialized kinds of knowledge and people who have questions that need answering. >> Aditya Parameswaran: Absolutely. Yeah, so I see this as just the first steps towards a very specialized marketplace where most of the people who are currently sort of going to day jobs are -- And this is certainly happening. A lot more people are looking for employment online, and this has certainly transitioned to skill labor of the kinds like programming, design, virtual assistant. All of this has already moved a lot to these crowdsourcing marketplaces, not enough to replace existing companies. But people really like this service, right? I mean it's a flexi-time, flexi-cost. They can choose whichever project they are interested in. It's great. And I think there will be a need in the future to optimize the use of these humans even for skilled labor, and that's precisely the point. >>: It's the ultimate outsourcing. >>: Yeah. >> Aditya Parameswaran: Yeah. >>: It's scary. [laughing] >>: Depends on who you are. >> Aditya Parameswaran: It is giving opportunities to everyone, right? Anyway, that's a [inaudible]. All right, so let me move on to filtering. Filtering is an algorithm that forms part of the core of the applications that I just mentioned as well as the other applications that I'll talk about later. It's also one of the fundamental data processing algorithms. So in filtering you have a dataset of items. I don't need to tell you this but you have a dataset of items, you have a predicate and you want to find all the items that satisfy the predicate. So in our case items could be images. The Boolean predicate could be, "Is this image a cat?" and I may want to find all the cat images in this dataset. Right? >>: The Boolean predicate is an English [inaudible]. It's an English sentence? >> Aditya Parameswaran: Yeah, it's an English sentence much like the predicates that I had earlier. Yeah. So this is not something I can automatically evaluate. That's the [inaudible]. So since I can't automatically evaluate I need to ask humans, right, does an item satisfy this predicate or not? And since humans may make mistakes, I may need to ask multiple humans. So the question is: how many humans should I ask? When should I ask them? How should I ask them? These are the kinds of questions that come up in this part of the talk. >>: Are you... >> Aditya Parameswaran: Yes? >>: ...stopping with the dataset? Is it something -- [inaudible] can be the web that searches, right? >> Aditya Parameswaran: Sure. But in my scenario I have a restricted dataset. So the way to think about this is let's say I did an initial gather set. I have a set of images that I consider as... >>: Because why I'm asking [inaudible] several queries... >> Aditya Parameswaran: Yeah. >>: ...[inaudible]. >> Aditya Parameswaran: Yeah. >>: Is this image a cat? If you could just do a Google or Bing Image search on cat. >> Aditya Parameswaran: Yeah. >>: You already get a bunch of -- So some parts of predicates in your task, you can probably push to the gather phase if you're working off a live feed in a certain [inaudible] or something. >> Aditya Parameswaran: So are you suggesting that I move -- So I think what you are mentioning is the option of moving some of the predicates from the filter step to the gather step. >>: No, so... >> Aditya Parameswaran: Is that the step? >>: So you are doing crowdsourcing. >> Aditya Parameswaran: Yes. >>: And the questions you are [inaudible] fairly general. >> Aditya Parameswaran: Right. >>: So what I'm trying to understand is if [inaudible] a set of precomputed items or --The most [inaudible] source is the web itself. Right? So if you... >> Aditya Parameswaran: Okay. >>: So if you think of the source as the web... >> Aditya Parameswaran: Yeah. >>: ...and you have some predicates in mind -- Let's say you have five predicates -some of those could be pushed to simple search predicates which you already filter the images from the web. >> Aditya Parameswaran: Okay. >>: It's sort of optimizing the gap, [inaudible] some of the filtering with [inaudible]. >>: In some sense I mean gather already has a filtering operator there, right? >>: Yes. >>: Exactly, yeah. >>: So gather-filter-rank... >> Aditya Parameswaran: Sure. >>: ...that is not [inaudible] sometimes. >> Aditya Parameswaran: I agree. >>: Right? >> Aditya Parameswaran: I agree. And that is one of the reasons why we haven't been able to optimize the entire workflow yet. I'm just talking about this one individual operator and trying to optimize that. There are very complex interactions between gathering and filtering and ranking in the sense that one of the versions of the system that we are building involves gathering keyword query suggestions from the crowd, retrieving a few items for each of those query suggestions, then filtering them and then going back to the gather to step to gather even more for the keyword query suggestions that did well. >>: So maybe I could rephrase the question a different way. >> Aditya Parameswaran: Yeah. >>: As a user of the system [inaudible] like a researcher, dataset of items on the web. There are all these images on the web and I want to know what are the images of a cat. >> Aditya Parameswaran: Yeah. >>: I can think of it in two ways, right? I just pose this query to the [inaudible] system. The [inaudible] system breaks it down into a gather phase. >>: Yeah. >>: Because you can't handle millions of images. It [inaudible] down to thousands of images. >> Aditya Parameswaran: Okay. >>: So it extracts something [inaudible] and poses a query to the web and gets [inaudible]. This is the first step. >> Aditya Parameswaran: Yeah. >>: And it shows the [inaudible]. Or I could have a different system [inaudible] me to [inaudible]. >> Aditya Parameswaran: Who is you? >>: I'm the user of datasets. The user of datasets to come up with the gather predicate. >> Aditya Parameswaran: Okay. >>: So [inaudible] is on the user, right? >> Aditya Parameswaran: In... >>: It doesn't automatically the predicate for the gather phase. >> Aditya Parameswaran: So the toolkit is general enough that it could have both options. So one options is -- So let me go back to dataset, right? >>: [inaudible] better than mention nothing and gather and just say search using Google Images and just fulfill the predicates. What happens? >> Aditya Parameswaran: Say that again. >>: My gather phase [inaudible]... >> Aditya Parameswaran: The topic is empty. >>: Empty? >> Aditya Parameswaran: Yeah. >>: So search all Booleans [inaudible]... >> Aditya Parameswaran: Right. >>: ...[inaudible]. >> Aditya Parameswaran: Right. >>: And then I say image of a cat. >> Aditya Parameswaran: Right. >>: What will happen? >> Aditya Parameswaran: So as a filtering predicate. So there are different versions of the system. One version of the system takes the entire query and asks the crowd for keyword query suggestions. >>: It will ask the crowd for keyword... >> Aditya Parameswaran: Keyword query suggestions. So that will be used in the gather step to retrieve initial items. Then, you will filter those items based on the predicate. >>: So [inaudible]. >> Aditya Parameswaran: Of course if I use the version that uses the topic to ask keyword query suggestions, that's obviously not going to work in this case. Yeah? >>: So just to sort of try to put this in a perspective that I can understand: one way to think about the gather step is when you give something crude, it's a way to specify what the set is. And then, the filtering predicates are actually a validation stage. >> Aditya Parameswaran: Exactly. Exactly. Yes, perfect. >>: So the other step also has a ranking, right? For example I want cat. >> Aditya Parameswaran: Yes. >>: Now in the Google Image there are... >> Aditya Parameswaran: Yes. >>: ...[inaudible]. >> Aditya Parameswaran: Yes. >>: But [inaudible]. >> Aditya Parameswaran: Yes. >>: So how do you even decide like how many to start with in the gather phase? >> Aditya Parameswaran: Great question. So while we have not yet done anything sophisticated in that step, all that we do is take the top ten results, multiply it by a factor K and then, retrieve those many results and then process it. That's all that we've done so far. So there are many ways of thinking about this question. One is that the search results are somewhat correlated with the final results, right, so beyond a point going down the search results is not a good idea. If you are searching, for instance, for let's say -- I don't know -- clipart of student studying, beyond the thousandth image you're not going to get student studying at all. You're going to get very noisy images. >>: So you are making some assumption about the data source, that Google Images is doing a good job. >> Aditya Parameswaran: I am making some assumption about the data source. I agree. >>: But shouldn't the size of an initial set -- Okay, so here's your query. For Google Images you get three million of them. You have a budget of five dollars, so shouldn't you use that budget as a guide as well. So how big should my initial set be? >> Aditya Parameswaran: Certainly. >>: Restrict that five million down to twenty. >> Aditya Parameswaran: Certainly. >>: Because that's all you can afford to ask. >> Aditya Parameswaran: Yeah. So that is something we haven't yet done, right? The entire workflow optimization is something we haven't yet done. So right now I have some ad hoc rules that govern how I use my budget -- I mean I have a rule that says gather so many items for how many items I actually need. But that is a great point, yes. Overall that's what I need to do. I need to think about how much I’m spending in the gather step, how much I'm going to spend in the filter step. And the set of items constantly shrinks as you go from the gather step to the filter step to the rank. Right? So I need to think about how much I'm spending in each of these steps. It's a very complex problem. And hopefully by just talking about filtering itself, I'll convince you that it's complex enough. All right? So should I get into filtering? >>: Yes. >> Aditya Parameswaran: All great questions. Please keep asking. All right so in this part of the talk I'll focus on the trade off between quality and cost. I will not consider latency; although, we also have results for that case. And for now I will assume that all humans have the same error rate. This is an assumption I'll get rid of later on in the talk. All right? So how do we filter? Well, we use a strategy. So this is how we visualize strategies in a two-dimensional grid: the number of nuances gotten so far for an item along the Y axis; the number of yes answers gotten so far for an item along the X axis. At all yellow points we continue asking questions. At all blue points we stop and decide that the item has passed the filter. At all the red points we stop and decide that the item has failed the filter. Okay? So this is just one example of a strategy, let me emphasize that. An item will begin at the origin. Let's say we ask a question to a human. We get a no answer; the item moves up. We ask an additional question. We get a yes answer; item moves to the right. We ask an additional question. We get a no answer; item moves up. And let's say I get a sequence of yes answers; we stop and decide that the item has passed the filter. All right? So the key insight here is that since I'm making the assumption that all workers are alike, the way I get to a point is not as important as a factor as I am there. So these strategies are Markovian. And for those of you who are familiar with stochastic control, this is in fact an instance of a Markov Decision Process so this might be familiar to some of you. So this is just one example of a strategy. Here are other strategies: always ask five questions and then take the majority. Wait until you have three yes answers or three no answers and until then keep asking questions. So other examples of strategies. Now let me move on to the optimization problem. So in the optimization problem this is one of many variants. I'm given or I estimate via sampling using a gold standard if I have one or I can approximately estimate it if I don't the Pr question human error probability. So this is the probability that a human answers yes given that an item does not satisfy the filter and the probability that a human answer is no given that an item satisfies the filter. And I also know the A-priori probability of an item satisfying or not satisfying the filter. So I know these quantities... >>: [inaudible] >> Aditya Parameswaran: Huh? >>: [inaudible] A-priori probability? >> Aditya Parameswaran: So if I have a gold standard then it's easy in the sense that -assuming the gold standard is a sample of the actual dataset then it would be an accurate estimation. The fraction of true yes's versus true no's would be the estimate of the A-prior probability. >>: So A-priori probability that any [inaudible] image is an image of a cat? >> Aditya Parameswaran: Exactly. So in the DataSift case, I estimate these quantities approximately as part of the processing. So since it's a completely unsupervised system I actually estimate these quantities while doing processing. So I do a little bit of sampling, approximate sampling to estimate these quantities. >>: How sensitive are the results to how [inaudible]? >> Aditya Parameswaran: I haven't really checked, so I don't know how sensitive it is. But my understanding is that these strategies are fairly robust, so even if the estimates are off -- And we've done this using synthetic experiments: even if the strategies are slightly off you still get fairly good results. Yes? >>: So have you done anything on -- Sorry. Have you done anything to filter out what should I call sloppy users or malicious users, somebody who just clicks no on every image or yes on every image? >> Aditya Parameswaran: Yeah, so there is... >>: How do you recognize that? Or you could also seed something, seed the input set with things that you know are correct. >> Aditya Parameswaran: True. >>: Right? >> Aditya Parameswaran: Yes. Perfect. So in the DataSift case it's completely unsupervised so I can't do this apart from sort of using the other workers, other humans, estimates to check if a given human is good or not. Right? That is a disagreement based scheme. We also have work on dealing with data quality but it's not integrated into this current system. So the right now the way I think about it -- In this part of the talk I'm assuming that all humans are alike. And since Mechanical Turk is such a, I mean, rapidly changing pool of people that's an accurate assumption to make because I don't have the reliable error rate estimate for people over time. Because the pool of people that I have access to rapidly changes, so at any given time I won't have a worker who I've seen before. >>: So Mechanical Turk maintain the accuracy of those users? >> Aditya Parameswaran: No. >>: [inaudible] >> Aditya Parameswaran: No. So all they maintain is the number of tasks that these users have attempted in the past and their approval rate. And approval rate is not of any good because if you do not approve their work then all the workers will boycott you. So you just approve their work typically. That's just something you do. So Mechanical Turk does not have a good reputation system. All right. So quality is important aspect. Right now we are sidestepping quality by assuming that all workers are alike; that's one. The other -- We do take into account if we have estimates of worker quality that that can be taken into account while filtering by suitably down voting, in some sense, the bad workers. And I will tell you about that later on. >>: Maybe you can take the users, the workers through some test to make sure that, you know, they're of some decent quality. >> Aditya Parameswaran: Great. Yes that is an option that is often used in practice. Unfortunately in DataSift, because the task is new every time a user uses my system I'm getting a completely new task. So testing the user on something I have information about earlier is not going to help. >>: But you can probably use... >>: [inaudible]... >>: ...the results from previous tasks and give it to your system to sort of judge the quality of previous workers. >> Aditya Parameswaran: But... >>: [inaudible] >>: I mean, suppose you are doing this task for, you know, thousands of things. >> Aditya Parameswaran: Yes. >>: Someone can just do ten of those and, you know, those things could be a test. Even that would be good enough probably [inaudible]... >>: Guys we can generate lots and lots of ideas. >> Aditya Parameswaran: Yeah, all good ideas. >>: Why don't let you continue on what you actually did. >> Aditya Parameswaran: All right. Thank you. So, yeah, and my goal is to find the strategy with minimum possible expected cost. In this case, since I'm paying the same amount for every question the expected cost is nothing but the expected number of questions. And I want my expected error to be less than a threshold, so this is the second objective. The last constraint is that I want my strategies to be bounded. So I don't want to spend too much money on any single item. So what the last constraint means is that the strategies fit within the two axes and X plus Y is equal to M. Okay, so I don't spend more than, say, twenty questions on any single item which is reasonable. All right, so how do estimate expected cost and error? So given a strategy the overall expected cost of a strategy is nothing but summed over the red and blue points X, Y. X plus Y which is a proxy of the cost times the probability of reaching X, Y. All right? And overall expected error is the probability of reaching a red point and the item satisfying the filter plus the probability of reaching a blue point and the item not satisfying the filter. So these are the two ways you can go wrong. And how do I compute these probabilities? Well I can compute it iteratively. So the probability of reaching this point is the probability of reaching this point and getting a yes answer plus the probability of reaching this point and getting a no answer. So I can compute these probabilities iteratively. So I now have a way of computing expected cost and error of any strategy. So here's a naïve approach to compute the best strategy for all strategies, evaluate cost and error and -- Yes? >>: I have [inaudible] question. Is column identical to the [inaudible]? >> Aditya Parameswaran: No, so I assume that I have my estimates already. >>: No, each picture is a [inaudible]. >> Aditya Parameswaran: Okay. >>: Each picture has a probability of satisfying the requirements [inaudible]. >> Aditya Parameswaran: No, in my case I know that each picture is either a zero or a one. I'm not given that each picture is a probability. It's not a bias. Each picture is a zero or a one. Given that an item is a zero or a one, I have probabilities of getting wrong answers. Okay, so naïve approach for all strategies that fit within the two axes and X plus Y is equal to M, evaluate expected cost and error, and return the best one. How do I compute all strategies? That's easy. For each grid point you can assign it one of three colors, red, yellow and blue, run through all possible strategies and give the best one. Right? Of course this is exponential in the number of grid points. If you have 20 grid points -- Anyway, so if you have 20 grid points in the order of 3 to the 20 -- And in other cases it gets even worse. So it is exponential in the number of grid points but this is not an approach we would like to take. So I have given you a naïve approach to find the best strategy, and I'll call these deterministic strategies for reasons that will become clear shortly. Computing the best strategy is simply not feasible. It takes too long. But the resulting strategy is fairly good. It has low monetary cost. I have another algorithm that also gives me a deterministic strategy. Once again this is exponential but it's feasible; I'm able to execute it for a fairly large M. the resulting strategy is slightly worse. It has slightly higher monetary cost. But I'm not going to talk about this algorithm either; I'm going to tell you about a different algorithm. In order to that I need to introduce a new kind of strategy. As some of you may have guessed, the new strategy is a probabilistic strategy. So in addition to having yellow, blue and red points, I have points that are probabilistic like this point. So with probability 0.2 you continue asking question. With probability 0.8 you stop and return that the item has passed the filter. With probability zero you return that the item has failed the filter. All right. So these are probabilistic strategies. We have an algorithm that gives us the best probabilistic strategy in polynomial time. And since probabilistic strategies are a generalization of deterministic strategies, they are in fact the best strategy. Period. Okay? And we can get that in polynomial time. Since it is the best strategy, it has the lowest possible monetary cost. So over the next four slides I'm going to give you the key insight behind this algorithm and then tell you about the algorithm. Okay, so the key insight necessary is the insight of path conservation. So you have for any point a fractional number of parts reaching that point. And what that point does is to split the parts. Some of the parts continue onward. Some of the parts you stop and return that the item either passes or fails the filter. So pictorially let's say there are two parts coming into this point. This point decides to split the parts 50/50 so one part continues onward to ask; one part you decide to stop. For the part that continues onward to asking an additional question, this part moves to the point above as well as the point on the right. Okay, so this is how path conservation works for a single point. Now how does path conservation work for strategies? You have one path coming into the origin. Since it is a continue point it lets the paths continue onward, so one path goes to this point and to this point. Once again since this is a continue point, it lets the paths flow onward. While this is a probabilistic point lets say the probability is 50/50, so half a path flows onward from here. So overall you have one path ending here, one and a half ending here and half a path ending there. All right, so this is how path conservation works in strategies. Now finding the optimal strategy is easy. We simply use linear programming on the number of paths. And so you have a number of paths coming into each point; those are the variables. The only decision that needs to be made at each point is how these variables are split. Everything else is a constant multiple. So the probability of reaching a point is a constant times the number of paths reaching that point. The probability of reaching a point and the item satisfying the filter is a different constant times the number of paths. And returning paths of fail at a point does not depend on the number of paths. All right? So finding the optimal strategy for this scenario is easy; you just use linear programming. Now I'm sure you thought of many issue with the current simple model. We have generalizations that can hopefully handle all the issues that you've thought of. All right? So let me pick a few of them to explain further. All right, so the first generalization is that of multiple answers. So instead of having a Boolean predicate, yes or no, whatever you want to categorize an image as: either being a dog image, a pig image or a cat image or you want to rate an item as being either 0 out of 5, 1 out of 5, all the way until 5 out of 5. In this case we simply record the state as the number of answers of each category that I've gotten so far. Once again I can use path conservation and a linear programming to find the best strategy. Second generalization is multiple filters. So, so far we considered a single filter. What if we have a Boolean combination of multiple independent filters like in my DataSift example? In this case we simple record the state as the number of yes and no answers for each of those filters, and at any point you can choose to ask any one of those filters or you can stop and return that the Boolean predicate is either satisfied or not satisfied. Then the last generalization is that of difficulty. So far we assumed that all items are equally easy or equally difficult, so they all had the same error rates. What if they're not? What if there is a hidden element of difficulty? We captured that using a latent difficulty variable and the error rate of each item is dependent on that latent difficulty variable. Once again, we can capture that in our current setup. Now let me move on to a harder generalization. So this is a generalization of worker abilities. So let's say I have three items whose actual scores are 0, 1 and 0. Worker 1 who is a very good worker decides to answer 0, 1 and 0 for these three items. Worker 2 decides to answer 1, 1 and 1 for each of the three items, so he's a fairly poor worker. Worker 3 is adversarial so he flips a bit for each of the items. So he's a fairly bad worker but he gives us a lot of useful information. We can just flip his bit. Anyway, so how do we handle such a case? We are losing a lot of key information by assuming that all workers are alike. So we can reuse the trick for multiple filters. We can certainly record the number of yes and no answers corresponding to each of the workers, and this certainly works. Unfortunately if we have many workers with varying abilities, we have an exponential number of grid points and, therefore, our approach does not scale. All right? So over the next three or four slides I'm going to tell you about a new representation that helps us solve this problem. Any questions at this point? All right. So instead of recording the number of yes answers and the number of no answers gotten so far, we record the posterior probability of an item satisfying the filter given the answers that you've seen so far along the Y axis and the cost that you've spent so far along the X axis. Okay? So now to make it clearer I'm going to map the points from the previous representation to the new representation. So the point at the origin maps precisely to the A-priori probability of an item satisfying the filter and cost is equal to zero. All right? These two points map to points above and below that point at costs equal to one. Right? And the remaining points would map to their respective points in the new representation. Now as an approximation, I'm going to discretize the posterior probability of an item satisfying the filter given the current answers into one of a small number of buckets. And as a result multiple points in the old representation may map to the same point in the new representation. All right? Notice that I can discretize it as finely as I want, as finely as my application needs. So if we have many workers with varying abilities, we can once again map this two-end dimensional representation to that two-dimensional representation. And as an interesting property: as we reduce the size of the discretization, make it smaller and smaller, the optimal strategy in the new representation tends to the optimal strategy in the old, more expensive representation. All right? So what changes in the new representation? Well instead of starting at the origin, you now start at the a-priori probability of an item satisfying the filter with one path entering the strategy at that point. If you have all workers having the same error rates, you have two possible transitions, one above and one below all on spending one unit of cost. So you always transition to the right. If you have N workers with varying abilities, you have order of N transitions. So the size of each linear equation scales up by order of N. And once again everything else works. You can use the path conservation property and linear programming to find the optimal strategy. So let me quickly tell you about one more generalization that works well in the new representation then I'll move on to experiments. So the other generalization that works well in the new representation is the generalization of a-priori scores. So what if I have a machine learning algorithm that provides for every item of probability estimate of that item satisfying the filter. For instance I may have a dog classifier and the probability estimate may be proportional to the distance from the classifier, as well as which side of the classifier the item lies. This is easy to use as part of my strategy computation. Let's say I have 50 percent of items with probability 0.6 and 50 percent of items with probability 0.4. I simply start half a path at 0.6 and half a path at 0.4, and the strategy computation proceeds as before. When running a strategy on an item, the item will begin at its a-prior score. And this apriori score sort of would capture the intuition when we already have the input dataset having a ranked list of results. Right? We could certainly help in that case as well. All right, so [inaudible] about a number of generalizations. We have other generalizations that I did not have time to cover. Yes? >>: Yeah, so I have like one question on the [inaudible]. So in this case you studied the [inaudible]. >> Aditya Parameswaran: Yeah. >>: But in practice I would argue for your application. A much more natural operator is [inaudible]. I mean I want to pick 50 images [inaudible]. Right? If my ultimate goal is to pick 50 images [inaudible], it might not be optimal for me to go through each image and get it graded, right? It's much better for me to consider an image, and if an image is marked yes I want to pilfer those images. But if an image is starting to have some variation, like variance in the marking, it's probably not useful because I want to focus on images [inaudible]. So it seems that the nature of the problem with fundamentally change if you incorporate [inaudible].... >> Aditya Parameswaran: Certainly. Certainly. So that's another problem we have studied. And the key insight in that scenario when you want a fixed number of items from a dataset that satisfies the predicate, is that as soon as an item falls below the average item in the dataset you would rather pick the average item of the dataset. That's an intuition that you had as well, and we have a separate paper on that. I'm not going to be focusing on that in this talk. In addition to systems DataSift also appears in lots of natural scenarios, things like -Companies do this all the time, so things like content moderation. A lot of companies have a content moderation phase before the user-uploaded images go on the live site. They have a content moderation phase where they use crowdsourcing services. In that case you need to go and manually check every single image. And the second application that I'm going to tell you about, also you need to go and manually inspect every single item. All right. So finding key items that satisfy the predicate, that's a natural algorithm that we've studied. Yeah? So now I'm now going to tell you about experiments. I'll use that as an excuse to tell you about the second application that we've been studying that is MOOCs. So I'm sure you've heard of MOOCs, massive open online courses. They're very trendy. There's in fact even a photoshop poster of the movie The Blob which has been photoshopped to read The MOOC which I though was quite cute. MOOCs are revolutionizing the world of education. There are hundreds and hundreds of courses each being taken by thousands and thousands of students. There are lots of courses that require subjective evaluation, courses like psychology, sociology, literature, HCI and so on. And there's no way TA's can go and evaluate all the assignments in all of these courses. So what we need, therefore, is peer evaluation. So peer evaluation is crowdsourcing but with an important twist. The important twist is that the evaluators are also the people being evaluated. Okay, so now the key question is how do you assign evaluations for submission so that you can accurately determine the actual grade of each submission? And notice these were the images that DataSift gave me for my initial example. Okay, so deciding whether or not to get additional evaluations for each submission is a generalization of filtering that I considered where I want to rate an item as being either 0 out of 5, 1 out of 5, all the way until 5 out of 5. So we are very lucky to have the dataset from one of the early MOOCs offered at Stanford. This is the Stanford HCI course. In this case you have 1500 students with 5 assignments each having 5 parts. These are graded by random peers whose error rates we know because we've had them go and evaluate assignments for which we know the true grade. Okay, so we know their error rates. And our goal is to study how much we can reduce error for fixed cost or vice versa. So here is one sample result. I'm plotting the average error. In this case the average error is the average distance from the actual grade. And remember actual grades are between 0 and 5. And along the X axis I have the cost; in this case the cost is the average number of evaluations for each submission. And I'm plotting three separate algorithms. The first is the median algorithm that requests a fixed number of evaluations for each submission and then takes a median. The one-class algorithm that assumes that all workers are alike, have the same error rates, and uses the old representation. And the two-class algorithm that puts workers into two buckets based on their variance, high variance and low variance workers, and uses the new representation. So for now let me focus on the median algorithm and evaluations equals five. So in that case the median algorithm has an error of 0.3. So this is in fact the heuristic that is currently being used in the course error system for a range of courses like psychology, sociology and so on. So we can get to the same error using just 60 percent of the cost, using the one-class algorithm, and just 40 percent of the cost using the two-class algorithm. From the perspective of error if I fix a cost at three, I can reduce the cost of the error by 40 percent if I use the one-class algorithm and by 60 percent if I use the two-class algorithm. So either way I can significantly reduce both costs in error using our strategies. All right? So at this point I'm happy to take questions because this is -- All right. Moving on. So let me tell you about other work in the crowdsourcing space that I've worked on, other research that I've done and then conclude by talking about open problems. Yes? >>: So [inaudible] filtering: so if you have multiple filters like in your example you showed, like four or five filter [inaudible]... >> Aditya Parameswaran: Right. >>: How do you handle them? Do you ask [inaudible] questions for each of them or... >> Aditya Parameswaran: Yes. Yeah. So currently the way we handle them in the filtering operator is by having a separate question for each of those filters. >>: Okay, but you could also consider kind of combinations and... >> Aditya Parameswaran: True. True. The reason why we decided to go with separate questions for each of the filters is because it's not very clear -- humans are more likely to make mistakes with a combined question because it's not clear what question they are answering. If it is a single-unit question, it's much more clear what they are answering. It's like does it satisfy this and this and this and this? They might say no even if it does satisfy or the other way around. But, yeah, good point. Okay so we have studied other aspects of data processing in addition to filtering: finding the best item out of a set of items; categorizing an item into a taxonomy of concepts; identifying a good classifier for imbalanced datasets; also the search problem, which is the one that you mentioned, finding key items that satisfy a predicate. Determining the optimal set of questions to ask humans in a lot of these cases is NP-Hard even for very simple error models; therefore, we need to resort to approximation algorithms. And recently we've started looking into some of the data quality issues as well which are common to all of these algorithms. Let me move on to Deco. So DataSift is, in my mind, an information retrieval-like system powered by the crowd. Deco on the other hand is a database system that's powered by the crowd. So I don't need to tell you this but database systems are very good at answering declarative queries over stored relational data. But what if the data is missing? What if I don't have the data? So Deco can actually tap into the tiny databases that exist in people's heads so it can answer declarative queries over stored relational data as well as data computed on the fly by the crowd. So if you ask a query like this, asking for the cuisine of Bytes Café at Stanford, Deco will gather the fact that the cuisine of Bytes is French and return that as a result for the query. So -- Yeah? >>: This system works perfectly as something [inaudible]. >> Aditya Parameswaran: Not the gather step. >>: I thought this is something that we're gathering. >> Aditya Parameswaran: It is gathering missing data. The keyword query suggestions - I mean it would take a lot of sort of mangling Deco to fit under DataSift. But it is true that this is a much more general purpose system than DataSift. All right? So here are key elements of Deco's design. It has a principle and general data model and query language. You have user configurable fetch rules for gathering data from the crowd. This is sort of like access methods if you will. User configurable resolution rules for removing mistakes or resolving inconsistencies from data gathered by the crowd. One such resolution rule could be the filter operator. Due to the three-way tradeoff between latency, cost and quality we need to completely revisit query processing and optimization in this scenario. So we have a working prototype which was developed by a colleague and myself at Stanford. And this is also a web interface. So the prototype supports Deco's query language, a data model, query processing, as well as optimization as a web interface where you can post queries, visualize query plans and get results. All right? So let me move onto data extraction. Let's say I have a web site like Amazon, and I want to extract an attribute like price from all the web pages on Amazon. I can ask humans to provide pointers to where the attribute is present on a given web page. Then, I can extract from all the pages on that site. But what if the web page is modified? So my pointers are no longer valid and I may end up extracting incorrect information. Like in this case, I may end up extracting the fact that the Kindle cost ten dollars. So what do I do in such a case? We built a robust wrapper toolkit that can reverse engineer where the pointers have gone in the modified versions of the website so you can continue to extract from modified pages accurately. So you can significantly reduce the cost of having humans provide pointers once again. So our robust wrapper toolkit had some very nice theoretical guarantees and over an internship I deployed this in Yahoo's internal information extraction pipeline. Okay, so there's lots of related work that we build on in the crowdsourcing space, work on workflows, games and apps. Whenever I give talks, I typically get questions on the first four topics; although, my work is more similar to the last two topics. Deco and DataSift are similar to the recent work happening around the same time on CrowdDB and Qurk. And [inaudible] there's been a development of a number of other algorithms, sorts and joints, clustering and so on. Okay, now let me tell you about some of the other research I've done. I've also worked on course recommendations for a course recommendation site called CourseRank. Course recommendations pose a number of interesting and challenging aspects. So you need to deal with things like temporality because courses are typically taken in sequence. You need to deal with requirements because courses need to be recommended that are not just interesting but also help the student meet graduations requirements. My course recommendation engine had some nice theoretical guarantees, and this was deployed into CourseRank. CourseRank was spun off as a startup by these four undergrads and deployed at about 500 universities. And I think a year ago it was purchased by a company called [inaudible].com. All right, so... >>: So you don't need a job, right? [laughing] >> Aditya Parameswaran: There's this T-shirt, right, that says, "My friends had a startup and all I got were the lousy research papers," right? Yeah, so I've worked on human-powered data management and recommendation systems. In addition I've also worked on information extraction and search but I won't have time to cover that in this talk. So in all of my work I've followed a sort of end-to-end approach to research. I model scenarios conceptually starting from simple error models and then generalizing, like in the filtering case. I formulate optimization questions that make sense in the real world. I find optimized solutions using techniques from optimization, inference, approximation algorithms and so on. And I build systems with these solutions, systems like DataSift, Deco, to robust wrapper toolkit and so on. Of course in research it's never really a linear path; there are lots and lots of iterations. But I intend to continue to using this end-to-end approach in research. All right? So I think human-powered data management is only going to get more and more important in the future. There are more and more people looking for employment online, so there's a need to manage and optimize the interaction of this giant pool of people who are interacting online in a seamless manner. And of course more and more data is being accumulated. However, many fundamental issues in crowdsourcing remain. Issues like it takes too long, sometimes the work is badly specified, sometimes workers are errorprone, sometimes humans don't like the tasks that we give them, and sometimes it costs too much. So I have initial angles of attack for all of these issues. Let me a pick a few of them to explain further. So latency can be addressed by having systems produce partial results as they do their computation. But this requires revisiting the computation models of systems and algorithms. Given two algorithms, how do we pick the one that produces interesting partial results faster? So to deal with poorly specified work, how can we use a crowd to decompose -- So one more point about the eager computation: this is related to some prior work in the database community on online aggregation. To deal with poorly specified work, how can we use a crowd to decompose a query to a workflow? What should the intermediate representation be? How can we verify the correctness of this intermediate representation? To deal with error-prone workers how can we monitor their performance and see that their performance doesn't start to drop? How can we be sure that our estimate of their performance is correct? And how and when should we provide feedback to workers that they're doing a good job or a bad job? All right, so the steps that I'm arguing for go beyond looking at systems and algorithms to the other steps in the pipeline, the interaction with humans, as well as the interaction with platforms. So once we solve some of the fundamental issues in crowdsourcing, I think there is no end to what we can do. There are many, many hard data management challenges that could benefit by plugging in humans as component. And of course designing some of these systems would bring about a whole range of additional challenges as well. For instance, how can we impact interactive analytics using humans? Can humans help formulate queries? Can humans help visualize query results? How can we build better consumer-facing applications powered by humans? By combing human and computer expertise, can I build a newspaper, a personalized newspaper that beats Google News, for instance? Can I build human-powered recommendation systems? How can I impact data integration, a problem database people have been working on for decades now, using humans? Overall I think there are lots of interesting problems in redesigning data management systems by combining the best of humans and algorithms. At this point I'd like to mention that a lot of the work that I've done in my PhD is in collaboration with a number of collaborators both at Stanford and outside Stanford. In particular I'd like to call out my advisor, Hector Garcia-Molina, my unofficial co-advisor Jennifer [inaudible], as well as frequent collaborator [inaudible]. At this point I'm happy to take questions. [applause] >>: [inaudible] >> Aditya Parameswaran: You guys are there. You guys are there. >>: Who's the collaborator [inaudible]? >> Aditya Parameswaran: That is Ming Han. I couldn't find a photo of him. [multiple comments simultaneously] >>: Collaborator humans and sometimes non-humans? >> Aditya Parameswaran: Yeah, I should have mentioned the crowd of Mechanical Turk workers. [inaudible] Any other questions? All right, thank you so much for attending. [applause]