>>: Each year, Microsoft Research hosts hundreds of influential speakers from around the world, including leading scientists, renowned experts in technology, book authors and leading academics, and makes videos of these lectures freely available. >>: All right. Hello, and welcome. We're pleased to welcome Peter Organisciak -- do you want to say it? >> Peter Organisciak: Organisciak. >>: Organisciak to -- the way that you can remember it, though, is it sounds very much like organized or organizing, and it's very appropriate for a person who does work in the information sciences space. So he's here from the University of Illinois at Urbana-Champaign, where he works with Miles Efron, and he's going to be here all summer in Cambridge, Mass., working with Adam Kalai and Susan Dumais and me this summer, and so he's out here to brainstorm, and then will be with us from afar over the summer. So, thank you, Peter. >> Peter Organisciak: Thank you. So, first of all, I'm very grateful to be here. I'm enamored by the setup that you guys have here. This is my second day, and I'm having a lot of fun. So, today, I'm going to talk to you guys about crowdsourcing, and I expect that some of you have some expertise in crowdsourcing. But I'll talk about it from a couple angles that I think you'll find interesting. Essentially, I'll look at it in two ways. In the first part, I'll talk about the motivations of users that contribute to crowdsourcing, a lot of things that you may have inferred from your past experience. But I'll hopefully help you think about it in a more schematic way, and that follows from my master's thesis. And then, secondly, I'll talk about implementation from the system end, namely, discussing ways to think about the reliability of the contributions that you get from crowdsourced information. So, yes, just a brief overview. I'm sure many of you are already familiar, but crowdsourcing is defined as the act of a job traditionally performed by a designated agent and outsourced to an undefined, generally large group of people in the form of an open call. The term itself, it was coined in a Wired article in 2006, and if anyone's ever read that article, it was actually a much more narrow scope of the definition. And, as the author of that article pointed out only a few days later, Jeff Howe, when people heard the word "crowdsourcing," they sort of coopted it and ran with it as something much broader than what he initially defined it as. So it became this umbrella term, and as an umbrella term, it encompasses a number of other things that we're already familiar with. So crowdsourcing itself isn't new, just that word to talk about the phenomenon is new. So some of us may be familiar with Commons-Based Peer Production by Yochai Benkler, which he introduced in 2002. Then, in 2004, he wrote a good book called Wealth of Networks. There was a popular book in 2004 by a journalist, Wisdom of the Crowds, where James Surowiecki made the argument of collective intelligence in aggregate, and it was sort of a counterpoint to our traditional understanding of crowds as mobs, as sort of irrational beings. And then, of course, in 2006 Luis von Ahn had the ESP Game, and he formulated that around the idea of human computation, so using humans in a way that you would traditionally use cycles, but there are certain things that humans can do that computers simply can't. So another way of thinking about that term of crowdsourcing is a verb to try to capture the idea of attempting to utilize the wisdom of the crowds, so trying to go out to people with a task and hopefully getting something intelligent back. And, since it's a verb, it doesn't necessarily -- it doesn't inherently suggest whether you are successful in crowdsourcing. To crowdsource is to push your project out, so, like, look what I crowdsourced with my friends. When I started a couple of years ago in this area, it wasn't always clear why anybody would ever contribute to some of the crazy projects that were coming out. People were excited, but they weren't exactly sure how you actually create something that gives as much as it takes, and at the same time, it was becoming apparent that the most successful projects seemed to find success almost by happenstance. So today, or at least in the first half of this talk, I'll try to talk about that issue of identifying whether it's a task that fits a crowd and how you get that crowd to come over to your side. Oh, yeah, and I have a bunch of photos from Flickr Commons. These are all public-domain photos. So here's another good quote from Jeff Howe: "We know crowdsourcing exists because we've observed it in the wild. However, it's proven difficult to breed in captivity." And yet that's changing. In the past few years, we've been filling in the gaps in our knowledge of online crowds slowly, but that question of what successful crowdsourcing sites do is an interesting question, because it seems almost like many sites have almost stumbled upon the formula. So what I wanted to do is actually look at the bottom-up learning of existing sites that have approached crowdsourcing and been successful with it, and I wanted to figure out exactly what they did, but there's a problem. So, like I said, 2006, crowdsourcing had just been defined as a term, at least, and the definition was being coopted by the public, so how do I sample crowdsourcing sites if we're sort of negotiating the meaning of this term? If I set a definition for the sample and then try to sample around that, my definition would likely be inadequate, and then I'd also run into the problem of methodically finding websites that fit that definition, right? So if only there was an online crowd that semantically described websites, which, of course, there was. Some of you may be familiar with Delicious, which was a website where people saved their bookmarks, and as they're saving their bookmarks, they type in tags that describe what that website is about. So I thought, if I'm looking for the sort of publicly defined area of crowdsourcing, why not look at the most common websites tagged with the word "crowdsourcing"? So I looked at that. I collected 300 sites. I looked at every single one of them, which if there was ever an argument to split up human labor into many tasks, it's when you're looking at 300 websites over the course of a number of weeks. I visited each one, wrote down keywords that described their methodology and structure, then, in subsequent passes, standardized them and synthesized them into higher-level concepts. So what I ended up with is a proposed list of 11 nonexclusive categories, and for me, my use was, because I wanted to get a breadth of what types of crowdsourcing sites there. So this isn't the taxonomy, but it is one that hit every single one of those sites, and then I was able to sample off of it. There were sites that approached an encoding methodology, so perception-based tasks, creation, idea exchange, knowledge aggregation, opinion aggregation, skills aggregation, and then there were also sites that took a commercial approach in their structure, some sites that offered a platform to make it easy for anybody to go onto that site and connect to a lot of people. The more sort of ludic sites, so gaming, playful sites, group empowerment sites and just-for-fun sort of sites, and I'll focus on a couple of these. This was the distribution, at least on Delicious, of the categories, but that chart is dependent on the biases of whoever tags on Delicious, so the chart itself isn't too important. But a couple that are interesting to us here -- encoding websites are sites that approach the crowd from the idea of - from the human computation angle, thinking there are certain things that require human abstraction, human reasoning to complete. If only there was some way that, by bringing lots of people together, we can make it easier to complete these tasks, because by their definition they're things that computers can't do. So here's an example that I like. I also like it because I found this Scandinavian classroom image, which sort of matches the Finnish word game. But this is a game for digitizing Finnish medieval transcripts. And what happens is, these little moles would hold up words with the OCR transcription and you say is this correct or incorrect. And then, as you can see with some of the letters, it's a bit of a difficult task. To a computer, I don't know how easy it would be to teach it that As look like that, right? So sometimes it works, sometimes it doesn't. Then, they have another game where the ones that people vote as being incorrect, then you type them. The reason it's a game is because you're rushing to do it as quickly as possible and you graduate to higher levels, where you do more difficult tasks, so more OCR problematic words -- I'm sorry, more uncertain words. Here's another example of a perception-based task. Once again, transcription. If you look at things like handwriting, that's another difficult problem, where OCR, you may find a lot of errors. It's certainly a problem we can tackle, but just with a lot of trouble. So this is from the Bentham archives. Jeremy Bentham was a philosopher, and what they found was, lots of people in Britain liked to contribute to this, because they liked learning about some of their history, especially retired folks and actually scholars on sabbatical. But one of the things about Jeremy Bentham's handwriting is, as he got older, it got messier. This is still pretty clean. As it got older, it was really difficult to make sense of it, and actually, I tried contributing to this one time. It was terrifying. Another one that's useful for us to know about, knowledge aggregation, so projects that bring together what we know or what we've experienced. I saw a number of sites like that. Obviously, we all know sites like Stack Exchange and found other sites that, relative to your position -- I found a mobile site, for example, that if you find a pothole or something wrong with your city, you could quickly notify the proper authorities, and then it mapped it. So, in aggregate, you could see which parts of the city were being neglected. It made the city more accountable. They couldn't just say, "We're going to look at the wealthy neighborhoods." Skills aggregation, pretty self-explanatory. I just like this image, to be honest. Auto polo apparently was a thing. Creation -- so that's where people create things from scratch. There was one really cool project that I looked at called Star Wars Uncut, which split up the first Star Wars movie into 15-second bits, and people would adopt these 15-second bits and refilm them in their own way. They would animate them or use Claymation, and they remade this full movie from that. Actually, I think they're working on an Empire Strikes Back one now. I saw a trailer a few weeks ago. So, yes, what did I do with this? Out of those 300 sites, I did a purposive sampling of a number of sites that well represented each category, and then, looking at those sites -- I won't get into details for time's sake, but I performed a content analysis on the points of interaction in each of those sites. And, from there, I went and I interviewed people to try to find out what points of interaction actually mattered to users. On the other end, I ended up with a number of primary and secondary motivators that seemed to suggest things that you either need to have or that is good to have in a crowdsourcing project to compel people to contribute. I'll read these off, but I'll look at them closer in a moment. So there was interest in the topic, ease of entry and of participation, altruism and a meaningful contribution, sincerity, appeal to knowledge and, of course, money. So let's get money out of the way. Money is the most reliable motivator, as you can expect. If you have nothing else, money can find you contributors. However, it also has a tendency to overwhelm other motivators. It can also be a bottleneck, so you're limited to how much you can pay people. It's not completely problematic to pay people for tasks. Your own Duncan Watts did one of my favorite studies a few years ago with Winter Mason, I believe, that show that there are still intrinsic motivators when you pay people. But it's definitely more difficult to get people to get excited about a task, even if it's easier to find people. Most of us know Mechanical Turk. Everybody knows? We know Mechanical Turk. Here's another example, Quirky, which is a collaborative product-creation website, where people submit ideas for a product that should exist, and then the community develops that product together. They vote on it, they come up with ideas on the specifics of how it will work, what the design should be like. And every activity gives you a small percentage of ownership in that product, so then you actually get profits if that product goes to market. This was one of the sites that I looked at, and the people I interviewed seemed to think that it wasn't just about the money. You see a lot of people just excited to create something or to actually have their name on something that goes to market. I'm sure many of us can relate to that. One of my professors, Michael Twidale, likens it to Tom Sawyer painting a fence, where he has to paint a fence while the other kids are playing, and one of his friends comes by and he tries to make fun of him. And Tom Sawyer's like, "No, I love this. This is fun." Another kid comes by and he's like, "No, I think this is great." And eventually he convinces another boy to take it over, because that boy was led to believe that it's a fun thing, and eventually everybody wants to paint the fence. And Mark Twain very nicely summarizes it, saying, "If Tom Sawyer had been a great and wise philosopher, like the writer of this book, he would have comprehended that work consists of whatever a body is obliged to do and that play consists of whatever a body is not obliged to do." Here was another really important motivator, interest in the topic. And you guys can probably guess that, but it kept coming up over and over again, so much so that I feel like the best areas for crowdsourcing are just areas of popular amateur communities that haven't gone online yet. You could probably come up with a great quilting crowdsourcing site. Star Wars Uncut, people participated in that because they like Star Wars, and there's a couple other examples that I'll talk about later, where this played a big factor. Ease of entry, ease of participation. This was cited as being important by every person I interviewed -- sorry, for every example and by most of the people I interviewed. I chose this photo, just because it's literally of Easy Street. Altruism and meaningful contribution. I didn't have a photo of a High Street -- or a High Road, sorry, but people like helping out with things. If they think you're being genuine in what you're performing, people can jump onto that. So here's a good quote from a Library of Congress report about Flickr Commons, where they mention that it was a very successful project for Library of Congress to partner up with Flickr. When they first announced it, they said, "Help improve cultural heritage," and that seemed to have caught a nerve with people. Galaxy Zoo is another example, which is where people encode galaxies, many that have never been seen before by human eyes. Appeal to knowledge, opinion, sincerity, and then there are a couple other secondary motivators. I won't go through all of these. These I found, while primary motivators, you need at least one of those in a site to succeed. These are things that seemed to encourage people to contribute more. So if they're already on a website, this will push them a bit further. This was sort of surprising for me, indicators of progress and reputation. So those are the gamification mechanics. I was surprised that it wasn't more important for people, but lots of people named it as like a secondary, tangential thing. I interviewed a guy who was a heavy FourSquare user, and I thought, "He's going to tell me about gamification," and even he said, "I like FourSquare, but I like it because I know what restaurants I went to with my girlfriend, and I know where my friends are. And then the points are good for keeping me more diverse when I go out." Cred, another one where I just liked the photo. I think it matches there. Just some feedback. Okay, so let's move on. That's fine and good, but how does it relate to us? So on this part, I want to talk specifically about how crowdsourcing relates to my area in information retrieval, but also the bigger problem of understanding the reliability of people when you're dealing with crowdsourced data. So here's the problem. Classification is a tiring task, and it's difficult to use on large scales. So Galaxy Zoo, for example, they classified something like 60 million galaxies. Before Galaxy Zoo, there was a doctoral student, and he did it day and night for months on end, and he did something like 20,000, and that was far beyond what anybody else had ever done in terms of galaxy classification. We just don't have the time or the sanity to perform classification on large scales, and in information retrieval, this is a problem because we need -- oftentimes, we rely on nicely classified data for evaluation. And, as a result, we're often evaluating on the same TREC data from the Text Retrieval Conference, and it's hard to do diverse work without the big overhead -sorry. It's hard to do work off the beaten path without the big overhead of creating something that you can evaluate by. Another Microsoft Scholar, I think, Omar Alonso, has argued for paid worker crowdsourcing as a way to overcome this tediousness and the problems of expanding to large numbers of people. So here is a question I was interested about. When you're not sure about a rater, how do you determine whether they're reliable, and how can you do so in a way that is fair to them? So if we crowdsource, if we have random online users contribute data that we're then building research off of, how do we know that we should be building research off of that? And, more specifically, how do we tell our reviewers that it's okay? Because, oftentimes, these people -- for example, workers on Mechanical Turk -- what they are is self-selected and they're semi-anonymous. So there can be people that show up, and they might be malicious. How do you know? In my specific case, I was working on a project that was improving retrieval for metadata records of varying lengths, and I collected a lot of relevance judgments, and they seemed good, but I wasn't sure. Partially, the kappa scores were low, which is a measure of agreement. So I needed to figure out, are there problem people in there and can I correct for them? I looked at them along three different -- I looked at them from three different angles. First, I asked whether the amount of time that somebody spends on a contribution -- so, in my case, making a classification, does that reflect the quality of that contribution? Secondly, I asked do contributors grow more reliable over time? So the more you do a task, do you get better at it. Finally, I tried to look at whether your agreement or disagreement with other raters reflects your overall quality as a rater. And, if it does, which by all accounts we should expect it to, how do you account for that? How do you account for people that seem to disagree with other contributors in the system? So, before I move on, this is what a task looks like. You're given a query, a description of what's relevant, what's moderately relevant, and then a number of results, and you say this result is relevant to the query or non-relevant -- that sort of relevant, non-relevant dichotomy is something that's common, albeit possibly problematic in information retrieval. Another bit of terminology, what constitutes correctness? I'll talk a lot about accuracy, which is just the probability of a person's contribution being correct, so what does this mean? We had one data set of oracle classifications, which were just reliable raters that had submitted the same classifications, and we could compare it to the reliable people. And then, at times, when we didn't have the oracle data, we also looked at majority rating. So what does the majority of people say? And this chart simply shows that there's a correlation between agreement -- when a user agrees with an oracle or when a user agrees with other users. So, when we didn't have oracle data, using majority rating was a good proxy, albeit more conservative, so it erred on the side of rejecting. And then I had two data sets -- we had two data sets. One was the one I previously mentioned, cultural heritage data, 23,000 relevance judgments, and it was very easy. It was essentially a binary task. There was relevant, non-relevant and then a third non-answer, called "I don't know." And then, for diversity, I also had a data set of Twitter sentiment ratings. I actually -- was talking to a computer science professor a few weeks ago, and he mentioned, we all have our embarrassing Twitter sentiment study in our past. That's like a rite of passage, and I just slunk into my chair, because I do, too. But yeah, so that data set had five possible answers, so it was easier to get it wrong. And then this I have mostly for information scientists, just to show that ratings distribution was inverse power law, which haunts us everywhere we go, sort of uninteresting distribution. But yes, so first question, does the amount of time a person spends on a question affect the quality of their answer? So if it did, then what we would see is people that spent more time on answering - sorry, more time dwelling on choice would be correct more often. Can anybody guess whether that ended up being true or false? The more time you spend on a task makes you more likely to be right? >>: It's false? >> Peter Organisciak: As Jamie said, it's false. So the more time you spend on a task doesn't suggest how good you are at that task. Yes? >>: Were you looking at mean time, or were you looking at the entire distribution? What were you actually comparing it to? >> Peter Organisciak: I was looking at the amount of time that people spent between ratings. So people were given a query and a number of results, and I was looking at the amount of time that it takes them from the last time they submitted something to the next time. So how long were they looking at that and thinking, "What's the proper answer," essentially. >>: The total amount of this? >> Peter Organisciak: It was the mean. Sorry. >>: For each person, how long they took on average compared to their (inaudible)? >>: You just have two groups, correct and incorrect, and then you look at the difference in mean? >> Peter Organisciak: No, this was by task, so this was all the ratings, I believe, and the mean amount for this distribution is ratings that were correct, how much time it took, this is how much time it took for incorrect ones, and there wasn't any difference. >>: So you're just doing a t-test between these two distributors. >> Peter Organisciak: Yes. >>: The distribution of the corrects and incorrects. Those look reasonable. >> Peter Organisciak: It was nonparametric, but they were pretty close. Nothing seemed to suggest that people that did things quicker were sloppier. >>: The p-value is reasonably low. >> Peter Organisciak: It is, and I'll get to that, because once I started pulling out some confounding variables, something interesting happened. Yes, so it was 0.15. >>: Did you [parcel] out anything like individuals who just happened to be very fast, whether they agreed or disagreed, or something like domain knowledge? >> Peter Organisciak: Sorry? >>: There are two things that could be correlated with speed. One is whether you're a quicktwitch or a slow-twitch person, so you should see that within an individual there wouldn't be much difference, but maybe you have, when you did this, more slow-twitch people. >>: And then you have cheaters, too, who just want to get through really fast. >> Peter Organisciak: Exactly, yes. >>: And those do a lot of hits. >>: But the other one is domain knowledge. You would have people that have zero variance in their responses. >>: Could be the difficulty of the question, too. >>: You just found the cheaters, and one thing that I and [Edjah Cameroon] found was, if you see a worker who's got a very low variance in the amount of time, then they're probably just pushing the button. >> Peter Organisciak: Exactly, yes. Domain knowledge is an interesting question because both my data sets were very simple tasks that the Twitter one was sort of perceptual -- does this Tweet look positive or negative? It would be interesting, with a more difficult task, where more domain knowledge is required, it would be interested to see whether this change is -- however, like I mentioned just a moment ago, we would give people batches of tasks. So it wouldn't just be one query and one result that you classify, because that slows things down, and they always have to learn a new query. So I started looking at whether the order affects dwell time, so whether it matters that this is the first task in a set -- or, sorry, if you are doing a first task in a set, does the amount of time you spend on it change things. Or if this is the fourth task in a set, does the amount of time you spend on it change things? >>: Can I ask one more detail on the setup? >> Peter Organisciak: Yes. >>: Did you have the same query and then different results for an individual? Did they look at several different results for the same query? >> Peter Organisciak: Yes. >>: So can certainly imagine that paging in a new query would take you some time. >> Peter Organisciak: So that was one of the reasons that we gave them a number of results, because there were multiple queries, and they were only shown 10 possible results for that query. Sometimes, they would return to a previous query. But, yes, so what I found was the amount of time doesn't change. Regardless of whether it's your third task in a set, fourth, fifth, sixth, it doesn't change, but the first and to a lesser extent the second task do take more time, which makes sense, right, because you have to remember, the first task is conflated with just the overhead of loading the task and figuring out what the question is, making sense of the query and what's relevant, what's non-relevant. So you'd expect the first task to be longer. However, what turned out to be the case was not only was the first task more likely -- not only did people spend more time on the first task, but the people that spent more time on that specific first task -specifically on that first task, were more likely to be correct. So that's why earlier there is that slight shift, even if it's not significant, because there's that effect only for the first rating of a set. So here's what I thought -- that this performance increase is related to people reading the instructions more carefully, so just people that spent more time on that first task were more correct because they paid closer attention to the codebook. But, if that was the case, then that would also linger to all their other tasks. It's not just I spent more time on the first rating and I got the first rating correct. It's spent more time on the first rating and I got everything more correct, or I was more likely to be correct on everything else. And that ended up being the case, where here I split up people that got the first task right and the people that got the first task wrong. I don't know why I did this sort of simplistic thing, but essentially we had more A students amongst the people that got that first task right, looking at the other nine tasks that they did after that first one. That seems to suggest I think an interesting user-interface question, where you could quite quickly figure out how likely somebody is to be correct based on how good they are straight out of the gate. It also seems to suggest, at least to me, that bad raters aren't necessarily always malicious, but sometimes they just didn't read the instructions good enough, and as a result, they were worse throughout. >>: Can you distinguish between bad raters and people who were just careless during the first session with this probe? >> Peter Organisciak: Yes, so I don't think you can distinguish. If anybody has any ideas for how you would study the difference between people that are trying to cheat you versus people that are careless. >>: The "I don't know option," how did you treat that? >> Peter Organisciak: I treated it as a non-answer, so if somebody said "I don't know," I just had somebody else rate that same thing. >>: Maybe you could look at some tasks that are really easy versus tasks that are hard and use them with cheaters who aren't reading the instructions, in either case. >> Peter Organisciak: Yes, so I thought that was interesting, and that's why I'm sharing it, because I liked it. So the next question I asked was experience. Do you grow more reliable over time? And specifically, I looked at lifetime experience, the amount of time, the amount of classifications you made for me overall, and also query experience, so the amount of classifications you made specific to an individual query. So in that previous screenshot, the query was plain, so how many times have you classified things as relevant or not relevant to a plane? And I didn't see an effect there, where somehow my dots got replaced with a dot missing in each here, but essentially, more or less, people on their tenth rating were just as good as people on their 200th rating. If you looked at query experience, it told a slightly different story. However, when I was reviewing this yesterday, I really should have put the standard deviations here, because I think this overemphasizes the effect. As you got to that further end, there was a lower end. There were less people that did 50 or 40, but still, it went up slightly. Yes? >>: Did you look at time for each user as you go and see each query? So what I'm thinking of is in the 2011 TREC, there was this study where they took time, and what they saw is essentially for each user, a user gets faster and faster as they get more experience, but at every query, there's a peak, so it looks kind of like a saw tooth. There's a peak for the new query. It starts to go down, then there's a peak, and it starts to go down. It's a very slow drift down in one direction, but you get faster per query, and then you peak every time you get an essentially new task. >> Peter Organisciak: That's a good question. I have the data to look at that, but I didn't actually look at it. I like that. Yes. >>: Yes, so this is a question from Rajesh Patel. How do you look at the threshold, the lowerinput threshold of the dwell times on tasks for people who were correct versus people who were incorrect? >> Peter Organisciak: Sorry, did I look at? >>: So for people who got better results, did you look at is there a particular threshold of time that is spent, the least amount of time and the most amount of time that was spent was? >> Peter Organisciak: I did that as I was playing with the data, and I don't recall for sure, but I didn't see too much of an effect. So people, even when they're really quick, seem to be pretty good at it, but one of the confounding factors might be how long the result was. So that might be worth exploring. I don't explore it in detail, remembering that the data set I was looking for specifically varied between really short, like tweet-length entries and then really long, multiple pages out of a yearbook entries with a sort of balance between those two extremes. Yes, so last question I looked at was whether your agreement or disagreement reflects your overall quality, and I tried two things. First, I tried removing raters that disagreed with a lot of people to see how that affects the quality of the data set as a whole. And then I tried weighing raters, so people, contributors, weren't completely removed, but the strength of their contribution was weighed upwards or downwards based on how consistently reliable they were. So for replacing raters, first of all, for replacing raters, what we did was we assigned raters a maximum likelihood user reliability score, which is just the probability of you getting the correct label. If you were below a certain threshold, I took you out, and that threshold was 0.67, which sounds like a nice not round number. It wasn't fully arbitrary. What I did was I created a number of bots that went through the data and randomly labeled items, and those bots on average had a reliability score of 0.67, so if anybody was below that, then they were worse than a random bot, so I took them out. And what happened was, sure, my kappa went up, but there was very little increase in overall accuracy, which, when you think about it, is sort of sensible. I was just taking out people that disagreed with lots of other people, but it goes to show that kappa scores, they're a good indicator for their original purpose of measuring contention, but they're somewhat misused as an indicator of reliability of a data set. And that's not to mention the fact that the kappa assumption wasn't intended to be applied to hundreds of raters, to begin with. Okay, so removing people didn't improve the reliability of the -- sorry, the quality of the overall data set. What about weighing users? So what I did with weighing users is I similarly calculated a reliability score for each user. I set it as the confidence in all of that user's ratings -- so if Adam, for example, had a reliability of 0.7, all his ratings were valued at 0.7 of a contribution. And then I went and I recalculated all the user scores after doing that. It's EM like in that I had this cost function with the user scores and, over time, what would happen -- you can imagine, what if I'm a good rater, but I just get unlucky, and I'm rating the same thing that two other really bad people are rating, and I disagree with those really bad people, but that's just because they're bad. They're cheaters or something. I don't want to be punished for that. So by iterating what you would expect and what you see to a certain extent is over time, those bad raters, their contribution grows lower. If I'm a good rater, my contribution grows higher, so I'm punished less, and it pulls apart the effect of who you're rating alongside. So I calculated this in two main ways. I won't talk about that third one, but I compared majority rating, which is just taking the answer that the most people say and then majority rating using the user weights, and then I tried different ways of assigning those user scores. The one I showed on a previous slide was the simplest one. This doesn't matter. I found that the lower bound was pretty high to begin with, so even when I was just taking majority rating, without doing anything fancy, when I was just saying three people, or these three people, look at this and classify it and the thing that two or three of you agree on, that's what I'll take as the answer. That ended up having a high accuracy rating for the overall data set. So the iterative algorithm gave me some improvements, but nothing significant. Hold on. Yes, so why is that? One reason that I think it is, is because my data was really clean to begin with. TREC last year had -- two years ago, for the first time, had a crowdsourcing TREC with essentially the same task, how do you identify bad raters and account for them? But their majority rating, so that baseline of the most number of votes, was around 70% in accuracy, while our data was 86% accurate out of the door. There wasn't much to improve on, and at the same time, it meant I didn't need to be so fancy with what I was doing. That suggested that -- sorry. My secondary data set, the Twitter data, that data improved more so because it was a lot more complex than just binary ratings, is this relevant or non-relevant? Yes. >>: The 86% to 70% number, are you comparing them over the same data set? >> Peter Organisciak: No. A different data set, so this was just using the TREC crowdsourcing data set. There, if you do something very simple like going by majority rating, you don't get the same quality as our data set. So there was something about our data set and how we collected it or maybe the queries. There was something about our data set that made people more likely to be right. Just by going by majority rating was good enough. And that seems to emphasize the point that, oftentimes, people are trying to be correct, because mathematically, if most people are making good faith effort to answer your question, then the people that aren't, they'll just get filtered out, even with only three people rating the same thing. And then, of course, we found that raters that get the first rating correct are overall better raters. This may be due to avoidable sloppiness, and I'd love HCI people to think more about that problem. How do you convince people to think more about the codebook, and I suspect there are a number of ways. One is maybe some qualifying tasks that a person has to take before jumping into everything. >>: You're not giving them any feedback, here, while they're doing it, right? >> Peter Organisciak: I'm not giving them feedback, but that's the other thing, like we spoke about yesterday. I'd love to see how people react if you show them how, according to our calculations, how well they're performing. Do they make better effort to backtrack if they're doing something wrong? Will they go and look at the results again? I found out raters grow comfortable with a task quite quickly. After two tasks, they're already doing everything at the speed that they'll do all their other contributions at. Then I found that EM-like algorithms, they're better for complex tasks, but if you have something simple like binary classifications, the improvement is small. So, thank you. That's the gist of my talk. I have a couple other things if anybody would like to talk about it, but thank you for coming utility. Yes. >>: Do you have any thoughts for what would happen in a machine where you didn't have such clean data? For instance, if you had a couple of judges who were just out to get you? Maybe they had a majority of the judgments? >> Peter Organisciak: Yes, so that's one of the things that protects crowdsourcing. I taught a course on crowdsourcing at a newspaper publisher a few years ago, and people would say -people would ask me, why would you let anybody contribute? What if they're malicious? They could take down your whole site, or they could write things that you'll be liable for. But there's strength in numbers, and oftentimes, that's what works out. Separate from paid crowdsourcing, if you have something like online commenting or whatever, you can also crowdsource the flagging feature, where a lot of people that are honest participants in the community will flag something that's dishonest. But your question hits on a point where these sort of approaches do break down. If there is a coordinated effort to be malicious -- 4chan spent many years trying to push at that boundary -- then you do break things down. >>: I've also run experiments where literally one or two people did like 25% of the work, and those guys were cheaters. >> Peter Organisciak: That's interesting. >>: But it may have been the nature. It was a very subjective test. It was kind of like, "Do you like this or not?" So maybe they had a feeling like, "Hey, I can do whatever I want." I don't know. >> Peter Organisciak: That's a good point. So if you -- most of the time you can hide in numbers. If you can't, that's when you come into problems. 4chan actually, they tried to rig -actually, a few years ago, they rigged the Time Person of the Year People's Choice voting, because Time didn't have any ways to stop people from repeat voting and no obstacles, so they made 10 million votes, all for the creator of 4chan, and of course he got to the top of the rankings. But then what Time did was they implemented reCAPTCHA, which was another project by Luis von Ahn, and what happened as 4chan couldn't automate a way to get past reCAPTCHA, so they tried to overwhelm it by just putting the same words in over and over, but they couldn't. So what happened was their gaming slowed to a crawl because they had to by hand make each vote, and it went from 10 million to 10,000 after the Captcha was implemented. Plus, they digitized some old books, although I think half the digitization was just lewd words, which weren't correct. Any other questions? I had this other sort of fun project that I did a few years ago. I was trying to explain to a class the idea -- one of the things that you really need to be considerate of with crowdsourcing is the ethics of it. So with volunteer crowdsourcing, if you have volunteers contributing, they can smell insincerity much of the time. They know if you're just trying to use them. If you're paying people, it's harder, because you can connect to markets in developing economies where you're not paying people a good wage, but it's the best they can get. So for this class, I tried to show an example of Mechanical Turk while making people stay aware of the ethical implications. So I had Mechanical Turk rewrite Jonathan Swift's A Modest Proposal. Is everyone familiar with A Modest Proposal? It was an essay by Jonathan Swift where he was making fun of the sort of cold social engineering rhetoric of the time. So lots of people were saying, "Here are the solutions to Scotland's woes. This is how we do it." And oftentimes, those solutions didn't actually think about the people involved, so Jonathan Swift made fun of it with this essay where, in a sort of straight-faced manner he suggested that the poor Scottish families should sell children for food to the English. Since that sort of satire was unknown at the time, it was shocking to people. Today, we know it as satire, so it's not shocking to us. I had the system, at the start of the class, pushed a button and it split the essay up into a number of sentences. Then it gave the sentences to Mechanical Turk workers, and they were asked to rewrite it in modern English, so just in colloquial -- in a more casual language. But what's more modern than Twitter, so I had them rewrite it in 140 characters, which was partially to keep them from just copy and pasting the same thing. So three people would rewrite every sentence, then three other workers would vote on the best sentence, and that would be chosen as the best version. Once again, this language, which to us isn't as horrific at the time, it suddenly became a lot more to the point and, of course, there were people behind the scenes rewriting this. I wonder what they thought. And, of course, there was a Twitter account until -- I don't know, I didn't like it. So that's just a fun little project, and that's it. Thank you. Are there any other questions or thoughts? Discussions? Any slides people would want to see again? >>: I didn't know about Flickr Commons. >> Peter Organisciak: Flickr Commons is great. Flickr Commons actually reminds me of another project, in Australia, that the National Library of Australia did, where they've been digitizing old newspapers. And the problem with Australia -- or with Australian newspapers -- is they often got printing presses from England, like broken-down, old printing presses. So the quality of old newspapers isn't good, so they scanned it, put it up and put the OCR text, but the OCR text is terrible. They put it into the functionality where you could hit edit and correct it. I think that's a very strong model, how, as a consumer, you're reading these newspapers and you see, oh, there's all these typos in the OCR. Oh, there's an edit button. Well, all of a sudden, I can switch from being a reader, from being a consumer, to being a contributor, to fix something. That's part of your experience. You're sitting there in front of the computer and you see something that can be fixed. >>: And you're around tomorrow, right? >> Peter Organisciak: Yes, if anybody would like to talk to me further or about any other projects. Great. Thank you.