23767 >> Desney Tan: Sue and I have the pleasure of introducing Michael Bernstein, coming out from MIT, working with Rob Miller. Michael really needs very little introduction. He's been out here a couple of times for internships working with four or five different groups. He's working now with MSR New England. So he's got plenty of experience with us. Michael's amazingly decorated, has multiple best papers, best paper awards. And has a fellowship in one of ours, the MSR Graduate Fellowship, and he'll tell us about his work over the last couple of years, combining human and machine intelligence. So Michael. >> Michael Bernstein: Thanks. Great. So I'm excited today to talk to you about crowd-powered systems. And crowd-powered systems are interactive systems that are going to combine human intelligence from crowds of people collaborating online with machine intelligence. And to set the stage for why this might be a good idea I'll start with the word processor. That's because the word processor might be the most heavily designed, heavily used interactive system ever. And like most interactive systems, it tries to support a very complex cognitive process, writing, and it does so by helping with some really complex manipulation tasks. You can think of by now how we've got some relatively efficient algorithms to help with layout, we can build language models to help with spelling and grammar. But at some level what we have really little support for is the core act of writing itself or maybe even editing, think of questions like expressivity or word choice. And or even situations, well, like this. So how often have we all been in a situation like this where you have an hour or two before deadline, strict order page limit and you're a little bit overlength. I think we've collectively burned more cycles fixing this situation than we'd like to admit. Now, historically, when you're in this kind of situation, you would turn to other humans, in particular, editors, if you were a published author you could turn to an editor who would help you shorten your text, who would collect things that Microsoft Word didn't catch, for example, spelling or grammar errors. And in a sense they were really a core part of the writer's toolbox. But this is really never been a part of the toolbox that we could make a permanent part of our software, because if you wanted to do that, you would have needed these editors to be available at large scale really at any time. And that just hasn't been possible. But today we do have tons of people online taking on some really impressive tasks. This is known as crowd sourcing, and even within areas related to computer science, we're using crowds to help collect data for machine learning algorithms. We're running studies on our systems. Even social scientists and economists, behavioral economists are running large scale studies using crowd sourcing platforms or folding proteins. We're even writing collectively in Encyclopedia. And this isn't a new phenomenon. In fact it goes back to the 1700s when the British royal astronomer started distributing spreadsheets for the calculation of nautical sea charts through the mail. Reached its height in the 1930s when a WPA project hired 450 so-called human computers, which is actually the source of the term "computer" that we use today. But what I'd like to point out is that this lineage of distributed human computation has acted as a batch platform. That is, you take a lot of your work, you push it over the wall. You wait a while, hours, days, eventually bring it back and run some analysis on it. What I'm going to talk about today are ways in which we can turn crowd sourcing from a batch platform into one that supports interactive systems. That is, rather than having a single human editor help you out with a situation like this, where we're stuck between what the user's willing to do and what the system can support, what if we had tens or thousands of individuals all look at your document, suggest ways that it could be improved or shortened? We could algorithmically start to identify the of their best suggestions and give you access to them in an interactive system. That's what I mean when I talk about a crowd-powered system. Interactive system, user interface that supports, that combines machine intelligence through whatever we can design or AI with crowd intelligence. Now, let's say that we thought this is a good idea. You're going to run into a couple of challenges when you try to build these kinds of systems. The first one is quality. A few weeks ago I asked a thousand people online to flip a coin and to type H if they got heads and T for tails. Hopefully in this room it would be about 50/50. Turns out on the Internet there's about a two-to-one ratio of heads to tails. And this is not exactly that the Internet's a biased coin. It's that people are trying to optimize for money. They start satisficing, they try to start generate randomness when they do so they do so in random ways you might notice and wants fully 7 percent of the respondents don't want to type H or T they actually went outside of the grammar if you will and typed out the entire word, misspelled it or enigmatic F. And these are the kinds of interesting challenges that we need to face when we start talking about integrating crowd contributions voluntary or otherwise into software systems. Because algorithms may not expect this. A second challenge is going to be speed or latency. If we're building interactive systems we expect them to react quite quickly. And crowd sourcing just is not react to that quickly. In fact, when it first came out people were very excited about it saying that it's extremely fast and then pointed out it was 48 hours before they got a response. In fact, some folks at U.C. Berkeley ran a survival analysis model found that the half life for responses in these kinds of systems varies between 12 hours, two days, roughly, depending how much you're offering in a paid crowd sourcing context. We really need to cut this down by orders of magnitude if we want to have interactive systems. So today I'm going to show that we can in fact create these crowd-powered systems that we can overcome these challenges with quality and with latency. And embed crowd intelligence into our everyday interactions. And that in order to do that, I'm going to introduce several computationally motivated techniques that help crowds accomplish these tasks that they wouldn't otherwise be able to accomplish. Again, overcoming these challenges of quality and of latency. Now, I'm going to focus for most of the talk on paid crowd sourcing. I'll come back near the end to talk about how we can use other kinds of crowds to take on lots of tasks that paid crowd sourcing would never be able to do. But for the moment you may have heard of Amazon Mechanical Turk, perhaps the most popular paid crowd sourcing platform. On Mechanical Turk, there are millions of tasks that are done, things that look like this, label an image, transcribe a short audio clip for small amounts, usually on the order of a few cents. People do a large number of these and hopefully make up a reasonable amount of money. If you look at the population on systems like this, it's about 40 percent in the U.S., 40 percent in India, 20 percent elsewhere. Across a variety of indices, gender, education, income, it mirrors the overall population distribution. Suggests that you really have some relatively educated individuals on these platforms who are looking to supplement or completely replace their existing income. We're going to use paid crowd sourcing to explore this notion of crowd-powered systems. I'm orienting my talk around two main systems. The first is Soylent, a word processor with a crowd inside, which will hopefully convince you that this entire idea is worth pursuing. And second is Adrenalin, which takes these concepts and makes them happen in real time quite quickly. I'll start with Soylent. Soylent is people. It's a word processor that's recruiting crowds as core elements of the user interface. Now, I'd like to point out before I start I'm actually not the first person to come up with this name. You may know one Daniel Fisher who handed the name off to me at some point. What I hope you take way from this section is we're really embedding crowd contributions as a core part of how this system works and that we're going to take something that crowds aren't potentially very good at and decompose it in such a way that we can actually focus individuals' efforts and get higher quality responses. So rather than tell you about Soylent I'm actually going to show it to you. So this is Soylent running on the Soylent paper, which is a little meta. But this is exactly the situation we pointed out earlier where we're a little bit overlength. Let's decide this conclusion is a little bit too long. Rather than shortening it myself I'm going to ask myself. I'll push it off to what Soylent is called shorten. When I push this, Soilent's going to push a bunch of tasks off to Mechanical Turk. You don't need to pay too much attention to the details here, but you can see that workers are marking up the text, making some edits, doing some votes. When it all comes back, we've collected all of these suggestions that the workers made and we can start to put them all together. And what you see on the left our original text and on the right everything that has been marked up. Anything that's underlined in purple here is a section of the text that has been marked up as being shortenable, a patch we call it. You can see every patch has a number of different options that have been suggested as potential rewrites that are shorter. We can then consider the space of all possible paragraphs, order them by length and give you sliders such when you drag the slider the text rewrites itself and becomes shorter, longer or really anywhere in between. Now when you're done your text is now on ten pages. This shows you a new kind of interaction we can build by talking to crowd intelligence. But we can also talk about how we might support existing AI systems. Crowd-proof is a crowd sourced copy editor effectively. It's going to find errors that Microsoft Word didn't and indicate solutions and give you plain English suggestions of what's the problem. You can see here that the crowd has suggested that this paragraph has two potential problems with it. You can see that they've explained in one case a sentence too long and another case here there's actually a parallel sentence structure error. So introducing is the correct way of putting it. Interesting thing about this second one is that this error got past, I think, eight authors and six reviewers before crowd proof caught it before camera ready deadline. The reason is that this is the bottom of page 5. By the time we're reading and getting to the bottom of page 5, our eyes are getting a little bit tired. But crowd members are coming in with many different perspectives. And they're perhaps even seeing this text as the first thing they see. So there's a lot of different -- there's a lot of different reasons here why we can -- that was not crowd-powered. Why we might actually want to draw on the crowds here. Now, we can also talk about how crowds might support natural kinds of input to a system. And in particular, I tend to write like this. I leave my citations in brackets like this, and I need to come back later and fill in and write out a bibliography. In particular if I'm using something that takes in bib tech as input I need to go find that metadata. Let's say I wanted to go find help with that we can push out something with a human macro not everything can be proof reading or shortening this allows you open ended request. In this case I might ask for help finding bib tech for finding the citations in brackets. Rather than writing this myself, I'm actually going to show you what one of our user study participants created. You can tell it's a little bit unclear. You can locate these by Google Scholar searches and clicking on Bib Tech. Not the clearest thing in the world but we'll go ahead and paste it in. We can say how many people we want to pay. How much we want to pay them. And when it comes back, you'll see something. It looks like this. Where they've gone out to Google Scholar, figured out what we meant, and brought back the Bib Tech. That in short is Soylent. Soilent's goal is to reach out to these crowd contributions to create new kinds of interactive systems. Interactive shortening, a new kind of interaction. Aid with proof reading. Supporting an AI system, and open ended requests like the human macro, find me a figure because it's the Internet they'll find you cats. These are the kinds of things that we think are possible when you engage with crowds as a core part of interaction. Now, if you were to try and build a system like Soylent, you might come up against some interesting challenges as I pointed out earlier. In particular, we've worked with a lot of Mechanical Turk workers on these kinds of systems and we simply see that roughly a third of what we get back or 30 percent is just not something you'd want to show to a user, particularly not a user who might be paying for such a system. And we need to actually deal with this quality issue, if we want to make such a system really large scale deployable. Why is this happening? I'll explain it through a couple personas. I took this paragraph off of a high school essay website. It's a really horrible paragraph. I've underlined a few of the issues here. We asked Mechanical Turk workers to in an open ended sense edit it, make it better, proofread it. You see two kinds of personas here. One we would call the lazy worker. This is someone who is trying to optimize for money and send a clear signal that they've done the work but do no more than they really need to. So given this really error-filled paragraph a lazy worker is going to do something like this. That is, they made a one character change to the world comradeship to fix the spelling. It's not surprising they did that because it was the only word that was underlined in their browser as being misspelled. But they made a clear edit. On the other end of the spectrum there's the eager beaver, someone trying to give a signal they've done the work but they sort of go too far. They go outside the bounds of what the system might expect. Given the same paragraph, an eager beaver will make some good fixes. But then they're also going to insert new lines between each sentence which I personally would consider an improvement to the text. And these personas are not specific to Mechanical Turk. You can think about say Wikipedia where you have some workers working quite hard to make edits but getting reverted because they don't know the rules. You have other people who are just getting by. And in my opinion the state of programming with crowds in the loop is still very early days. And in my view it's sort of similar to before we had patterns like model view controller and so on that started to codify best practices such that you could get reliably better results. So our goal here is going to be to start thinking about what such design patterns might be in the crowd computing space. I'll introduce one that we use in Soylent called find, fix, verify. The notion is that this design pattern is oriented towards these open ended problems, like things that Soylent tackles. I can contrast it to a close ended problem something like a multiple choice test where you can upload some small percentage of ground truth and use that to compare and figure out whether the work is good. Here, there's a huge space of possibly correct answers. What we're going to do is we're going to decompose this very open ended problem into three stages which are slightly less open ended and give workers more direction. I'll explain it through an example. We used this both with proof reading and shortening. I'll show you with shortening here. Rather than having workers directly edit the text, we're first going to have them just find areas of the text that can be edited. We're going to effectively get a heat map over, say, the paragraph. We're going to look for independent agreement across workers to certify an area of the text as a patch. And we're going to send each patch out in parallel to a fixed stage. In the fixed stage we're going to show another set of workers exactly one of these issues and ask them to fix it, that is, shorten the text, fix the typo depending on the text. We're going to collect a bunch of these suggestions, randomize their order, and put them through a verify stage. The verify stage is going to try and force some invariance here. Basically we want to make sure we're not changing the intended meaning of the text and not introducing any new style or grammar errors. Anything that survives the verification stage we can finally pass back to the application logic and in particular here create something like shorten. Okay. So why is this a good idea? In particular, why would we split find from fix? Why not just let workers go in and sort of improve the text? Well, one major reason is that we're actually taking advantage of these two personas I introduced. In the fine Find stage we can force lazy workers to find two or three errors in the text which gives them a lower bound, that they can understand and we can understand. In the fixed stage we can point these lazy workers at a problem that is perhaps not the easiest one and they can't sort of get away without having fixed that particular problem. So by giving them a more specific task, we actually end up producing higher quality results. At the same time, we can focus the eager workers on a task that we really want accomplished right now and hopefully keep them from going too far off the rails. It also allows us to group the suggestions. So we can get that drop-down if you ever passed out a draft and gotten a bunch of different edits that are confusing, what we know now is given a particular problem, these three edits are all different ways to fix it. So you don't have to actually do that merging yourself. The notion behind the verify stage is that we get higher quality output by putting these workers in productive tension with each other. So you have one set of workers whose goal is to try to suggest options and another set of workers whose goal is to consider critically whether those are correct. I'd like to point out that while this was somewhat early in the space, there's really a growing literature, much of which was done here in fact that's playing into this larger space of what happens when we combine crowds and algorithms. >>: Why verifier versus a more general rank that would show two possible solutions to someone and have them pick the better one? >> Michael Bernstein: It's actually agnostic. With crowd proof we just try to pick the end best. Like the one best, because we want to make one edit. With shorten, we actually want to filter anything bad and sort of get a general rank, because we just want to continue -- we want a set of as many options as possible. But you can imagine doing a rank and then you'd have to assume where some cut-off would be. But certainly there are many different ways you can run a verification stage. This is just one way. Is that address your question? >>: I think so. >> Michael Bernstein: Come back later if I'm still being confusing. We wanted to know whether this worked. In particular, we wanted to know three things. How high is the quality, is this look like, how long do you have to wait? And how much is it going to cost? So we can throw a bunch of input texts at Soylent, in particular here is shorten. Here are five different input texts we give it ranging from Tech Crunch, HCI papers, OS papers, and my personal favorite, a rambling e-mail from the Enron corpus and feed it through Soylent and get edits that look something like this. Now across all these texts we see we cut about 15 percent of the original paragraph length on average. What this means is that you can take an 11-page draft of a paper, hold constant the title, the figure, all that boilerplate, run it through shorten, and get a ten-page paper back, without having changed any of your core arguments. So why is this work? How is it work? Workers tend to avoid any sort of technical content they don't understand and instead focus on wordy phrase, in this case the phrase are going to have to can be changed to have to without changing the meaning of the text too much. These are exactly the set of phrases we can start to collect a corpus of and train a machine learning system to take over much faster and free, just have the crowd take over a verification stage. Now, they make more complex edits as well, for example, merging sentences. This sentence now reads the larger tangible bits project, which introduced the meta desk into companion platforms. Okay. But this does not always work. Here's some interesting ways in which it fails. One is that workers are not a member of your community of practice. That is, they're not experts. And they might mistake their expertise. There's a signalling phrase in academia, we say in this paper, we argue that: Workers just find it boring. So you may disagree legitimately. So expertise is one issue. Another one, which is endemic, has to do with parallelism, that is, workers in one patch can't see what the workers in the other patch are doing. In this case you have two list items and they cut the main phrase from one and the parenthetical from the other which leaves the resulting sentence something meaningless. If you wanted to fix this you would need to talk about enforcing global constraints which is how Chi and Erica have been looking at or merging patches together when they get too close together. So across these three stages we're recruiting hundreds of people for each of these texts. Costs about $1.25 a paragraph. If you're willing to wait longer it can get down to about 30 cents and from there you can start talking about optimizing and sort of in decision theoretical terms trying to minimize the number of workers you would need for each stage to optimize some global quality constraint. Now, how long do you have to wait? There are two types of wait time in Soylent. The first one is between when Soylent asks for help and when a worker says, okay, I'm going to help you out. Now, if you sum the median find, median fix and verify, you see this takes about 18 and a half minutes. It can take much longer. It can stall, but roughly you're looking at about 20 minutes. The second wait time has to do with between when the worker says they'll help and they actually complete the task. If you sum the medians, this is actually much faster, it's about two minutes. Now the second half of the talk I'm going to show ways in which we can get that 18 and a half minutes down by several orders of magnitude. But you're looking at perhaps in the limit about a two-minute wait between when you asked for help and when Soylent can get back to you. So we can do the same thing with crowd proof. We can give it lots of different input text, add Wikipedia pages, text which passes Word's grammar checker and the one at the top which is an essay written by a nonnative English speaker. I'll focus on that one. You can see some of the edits it makes. Word by itself finds about one-third of these errors. Crowd proof finds about two-thirds of them. Interestingly, they find different errors, which is to say when you combine them you get about 82 percent coverage. When it finds an error, crowd proof fixes it about 90 percent of the time. So when does it miss? Most commonly when there are two errors in the same patch and the lazy workers come in and fix the obvious one but don't notice the more detail subtle one. So the same processes happening over and over. Human macros, same thing, you can see exactly the text that I pasted in earlier, other things like finding figures, changing the tense of a document. I'll focus again on the first one. The input text looks something like this. In fact, if you're familiar with the literature, you know that this is an incorrect citation, Duncan Watts is one person, not two. But the worker still managed to figure out a noisy input and command and get the correct answer. Now, there's no verification stage here which means the 30 percent rule comes back. So we see about 70 percent of the time these things are perfect. And about 90 percent they have the right idea but some subtle error in them. So so far I've introduced you to Soylent, which is introduced this new space of interactive systems that are powered by crowd contributions. And in order to do this, I've introduced the find fix verify design pattern to you which has started to focus these contributions to address questions with quality. >>: Can you go in what percentage of the workers end up actually contributing against the verification. >> Michael Bernstein: What percentage? >>: If you have a lot of lazy workers giving you totally useless things. >> Michael Bernstein: What you're asking is there high correlation if you give me bad things now will you give me bad things later? >>: Yes. >> Michael Bernstein: My sense is yes, I don't have numbers for you there. Certainly there's a parallel contribution in most of these things such as a small number of workers are actually producing a large amount of your results. There are lots of folks who think about sort of this global quality management, Crowd Flower is a good example of someone who maintains sort of across many tasks. And I think the first thing you'd want to do is start to build up a better notion of reputation, either as the platform like Mechanical Turk or O desk can do this better than any individual requester or as a requester we can get feedback from the user saying I like that edit. That edit was really bad and we can sort of propagate backward. >>: What's their deal and were they're ->> Michael Bernstein: So often -- [laughter] -- I don't have deep knowledge. That is, I think you'd want to spend more time talking to them to get a better sense but there's this notion that you're overcompensating. Sometimes it has to do with if you have an interesting task or if it's the first one you don't know the task parameters yet or boundary conditions. I think that's one thing that's going. I think often it has to do with them overestimating their abilities as well. Like yes I'll just insert new lines. Not a good idea. >>: You mentioned that the percentage like the payment, by the weight, digital affected the quality of the work. >> Michael Bernstein: No. In fact, that Winter Mason and Duncan Watts paper demonstrated that paying more gets you more work, that is faster, but no known increasing quality. In general, that's what we see. You need to design your tasks better to get higher quality results which I think as an HTI person that's nice to hear. That means I can actually have an impact. At this point I'll turn to adrenalin which is going to take these ideas and push them into the real time space. The reason we want to do this is that the kinds of applications we can build are really constrained in a very deep way by latency. Now, soil is one of the first crowd-powered systems I can actually turn to the broader research literature that has started to explore the space in a much broader way than even I could alone showing that it's useful for design, health and nutrition, robotics, vision, many other kinds of things. But, fundamentally, all of these applications are constrained by the same limit as Soylent, which is that sort of 20-minute wait time. In fact, the best result we've seen in the literature comes out of Jeff Bigham's group out of University of Rochester, was able to get one response from a worker, about 60 seconds after you ask. And that response isn't verified because it's a singleton. And if that's the best we can do, we've already lost, because usability psychology has demonstrated that users will only pay attention to an interaction at max for ten seconds; they'll lose the flow. What we really need to create are on demand flash, real time crowds. That's our goal here. We're going to pick one motivating application, which is going to be adrenalin, a camera, that is for novices, sort of built into your cell phone. And what it's going to do is for the kinds of situations pushed out beyond what Mike's been working on, we try to sort of find aesthetically subjectively the right moment to take the photo. It's sort of the moment camera but crowd-powered. We want to do this in real time, because ever since the introduction of the digital camera, it's become a core part of our photo-taking experience that you take the photo. You see the result. We can take another one you can share it with your friends we don't want to go back to an era where you have to develop your film overnight. So this is the kind of thing that adrenalin looks like. You can see they're capturing a video of people doing high 5s there. One of them in fact is me. As of right now we just made a request to the workers to help us choose the best frame. They're going to poke in along the bottom there. Start exploring the space, and very quickly they'll focus in on the final frame now. So just a few seconds later we have a final frame. Here are a few other kinds of pictures that adrenalin takes. You can see people trying different angles. Different kinds of poses. Action shots like people jumping off a bench, or just people being silly and hoping that the crowd will pick a cute moment. And, again, we can collect data from these kinds of systems and start to train more automatic ones as we push out. So if we want to create adrenalin, we need to solve two problems. And these correspond to the two wait times I pointed out before. The first is how we get the crowds there quickly. And in order to do this I'm going to introduce a new recruitment approach for crowd sourcing we call the retainer model. The idea behind the retainer model is we're going to ask workers to come and sign up before we need their help. And we're going to actually pay them a little bit extra while they can go do anything else. They can work on other tasks. They can check their e-mail, they can chat. But as soon as we have a task for them they have implicitly agreed to come back. When we have a task we just pop up a simple JavaScript alert, brings their attention to our browser tab and we go from there. So is this bring people back quickly? It's an empirical question. In fact, in the space of HCI kinds of questions it's one of the most measurable. So we ran a study on Mechanical Turk, counter balanced across days of the week. Times of day. And what we did is we had people sign up for a task and then we called them back. So I'm going to draw a graph here. On the X axis you're going to see how long it took between when workers saw this dismissal and when they clicked the okay button and started working. Now, on the Y axis we're going to see CDF, what percentage of all the workers clicked the okay button at least that quickly. They're randomized into different weight time buckets. So if the workers weren't waiting very long, you saw a curve that looked something like this. If they're waiting a little bit longer, a curve that looks more like this. What you can take from this is if the workers are waiting under ten minutes, you get about half of them back two seconds after you ask. And in fact you get about three-quarters of them back, three seconds after you ask. Now, what happens if they've been assigned to wait longer? Now you see more attrition. But in the separate experiment we found if you offer a small bonus you can take a curve that looks something like this, 25 percent chance of this, woulder coming back, and push it all the way back as if the worker hadn't been waiting at all. We changed the incentives and we changed the behavior. So I noted I said we're paying half a cent minute into the sort of expected wait time. So it costs about 30 cents an hour to have someone on retainer. Now even just with this, we can create some real time kinds of applications. We built one called AB, sort of crowd sources kind of instant votes. If I want to know which tie to wear today or which of the two designs people like better, over here on the right you'll see a go button. I'm going to click the go button and replay a result from one of our studies. So this is something that took 20 minutes with Soylent. 60 seconds with Jeff Bigham's work, and we get five votes in about five seconds. So this is the kind of thing that the retainer model can do. Crowds in two seconds, and traditional crowd sourcing kinds of tasks in about five seconds. And that's great if what you're trying to do is choose between two photos. But adrenalin is not. In particular it's trying to choose between say 100 or more photos all at once. So what happens now is the workers arrive quickly but it takes them a long time to actually shuttle between those last few frames and choose the best frame. So how are we going to help them find that decisive moment. How do we help the workers work together in order to overcome these slow work times? And we're going to take advantage of one notion here. Which is that we really have created synchronous crowds. For the first time you can assume all these crowd workers are arriving at once. Not sort of arriving and leaving at individual whim as you usually have on Mechanical Turk. And we can start to think about how we can get them to work together. In particular, I'm going to claim in we're smart about this we can get the crowd to work faster collaboratively than even the single fastest member of that crowd. The way we're going to do that is through a technique called rapid refinement. In continuous search space like what we had with adrenalin the notion with rapid refinement is to look for agreement early as it's starting to emerge, before people would have made their final selection. And use that to reduce the search space quickly and focus everyone's attention. So I'll explain what we mean with pseudo code here. On the left you see what the server has. On the right there are three workers who get initialized to random positions in the video. The server sees all of them, and we're going to start looping until we get down to a single frame. We're going to look for agreement. We'll see how many workers are within a particular region of this video. Right now there are none. We'll wait until there's a certain amount of agreement, say two-thirds, as the workers start navigating through the space there's still no agreement. Eventually you'll see that two folks do come together indicating that they are interested in the same region rather than immediately jumping forward, we're going to make sure it's not a false positive. We'll make sure they stay in that region for two seconds. They do, we're going to certify this as a refinement. Reduce the search space so those folks who agreed can stay exactly where they were. Anyone who disagreed won't get paid yet, and will get reinitialized to a random new part of the video. And we're going to keep doing this again and again until we get down to a single frame. This is how rapid refinement works. I'm going to show you exactly the same video I showed you before. Just focused on the bottom part. So you're going to see that workers arrive and very quickly they're going to start agreeing on that central region, we're going to have refinement and an overlapping another refinement down to a single frame. So just a few seconds. Now, what came out of this? That is, is this work? Do we have some sort of quality time trade-off happening here? And when we have low quality results, what's going on? Sumit. >>: Seems like there's an underlying assumption there's kind of a single best place, because if you have multiple error functions ->>: So if there's bimodal distribution. >>: You have problems there. But is that generally the assumption that the video is taken away with one kind of optimal place. >> Michael Bernstein: We assume there's one intent. But the nice thing about crowds is that there's many of them. What you can do, although we don't do in the current implementation is fork. You can imagine this as populating some probability distribution if you see two peaks you can put one set of people over here and one sort of people over here or focus on one and you can wheel around when you have more time to explore the second one. And I'll point out another reason in a minute why that might be a good idea. >>: One more quick question. What's the final dollar amount that you were paying for finding that frame, when you add ->> Michael Bernstein: I'll show you in just a moment. So cost is going to be another element here. So we actually had 34 folks, I think, from our university come in and take video photos from this, and we produced five different candidate frames from each of these input videos. One of these frames was generated using rapid refinement as I described here. A second one was effectively at ground truth. We have a professional photographer come in and choose that best moment. Third was an off-the-shelf production level computer vision algorithm choosing aesthetic or representative frames within video. You can think this is effectively Google what it is when it chooses video. The other two techniques were more crowd sourcing-oriented. Generate and vote looks a lot like find, fix verify. We call people in off retainer. They nominate frames. We call more people off retainer, they vote amongst those frames. Generate one just takes the first response we get in generate and vote. That is, the fastest member of the crowd, as soon as they produce anything we just take it. So we can measure two things. One is -- sorry three things. Cost, latency and quality. We'll start with quality. So we can have these people rate on a nine point Lichert scale how much they thought the photo was what they were looking for, that they like it. What we see is rapid refinement tends to do statistically better than computer vision, which chooses a different moment. And statistically is indistinguishable from the photographer due to large variance. Now, typically you see something that looks like this, where the crowd chooses something in the same general area, but not exactly the same frame. On the top row they were actually just one frame apart. Sometimes you see something like this, though, where you notice this is a bad photo. The guy's eyes are closed. It's blurry. So what happened here? We actually had a false positive. You had two workers who were interested in nearby regions of the video that didn't overlap. But they were close enough to each other that the system thought they were interested in the intersection. Snap down, and they were left with a region of the video that had nothing good. So if you wanted to catch this you would need to notice thrashing behavior and pop back out and explore different area. Okay. So here I can answer your question about cost, rapid refinement was about 20 cents a photo. And it went up from there. So what I would hope you would take from here is that rapid refinement was actually not only the fastest, but it was the most reliably fast. It statistically had the least variance, which I would claim is very important for interactive systems. You don't want something that reacts quickly sometimes. You want it to be sort of reliably reacting quickly. Really what's happening is we're pulling up the tail. Sometimes you do have fast individuals. But sometimes you don't. Rapid refinement can identify that longer tail or takes that longer tail and pushes that probability mass to the left. Generate and vote still performs in under a minute, which is much faster than Soylent and, in fact, matches the quality of the photographer, which we thought was pretty cool. >>: I'm a little confused about the second row there. Because it seems like I mean generate one, if you were using the same retainer scheme as you are in the first one seems like that should be pushed even further towards zero because it's the very first person that responds to anything, in upper case you're talking about some cascade. >> Michael Bernstein: It is pushed a little bit farther to zero. If you want to like take your eyeglasses you can see that it's actually sort of one unit left. >>: Scan the whole thing. >> Michael Bernstein: They still have to scan the entire thing. It takes them some time to get called off retainer. And we're just taking the first one. So we have five people on retainer. We call them all back and we only pay attention to the first one. Sometimes you don't randomly have a fast person in your crowd. So that's really what's happening here. Okay. So we make a few trade-offs. One strength is we actually get fast preliminary results. So within that ten second boundary we can return something to the users. And that happens on average within that first refinement happens within ten seconds. We also don't need a separate verification stage. Because verification is effectively built into this algorithm. We're looking for agreement as we go. But we do sacrifice some things. We're sacrificing some amount of quality to get the speech write off, we can think of it as randomized algorithms where you're not getting potentially optimal result but something that's much faster. So you have this trade-off now. And more importantly, in my opinion, is the fact that we're actually stifling individual creativity in the system. And this is not just adrenalin and rapid refinement. Fine fix verify is the same thing. Most crowd sourcing systems have all this regression to the mean effectively happening. Imagine you were the photographer in the crowd. You have no special ability to actually pull the crowd toward what you know to be a good result. If you want to push forward in this, I think you want to start talking about automatically identifying these experts as we were talking about earlier and giving them a privileged position within these systems and these algorithms. Now, in terms of generalizability, rapid refinement we think applies to sort of single dimensional search spaces largely just within photography you can think about brightness, contrast, color curves, these kinds of things. So by combining the retainer model and rapid refinements, we're able to execute these really large searches in a human perceptual space within about ten seconds. And this allows us to turn around and start asking the same kinds of questions about say creativity support kinds of applications. This is PhotoShop. Let's say you were creating a poster for a rock concert, and you wanted to have a band of or a crowd of screaming individuals in the audience. This puppet work tool allows you to author a control point sort of like [inaudible] work and you can drag it. Let's say we call people off of retainer and say make that person look excited. We can have a bunch of individuals do that manipulation. We can take all of their suggestions, draw them back into a layer in PhotoShop and produce something that looks like this. So, in particular, with about eight workers on retainer, you start getting feedback in a couple of seconds. You get your first figure in a half minute. We went out to several hundred figures and kept getting new ones every three seconds on average. So we think we've really closed the loop here and connected this back to a productivity-style, creativity support desktop application that's allowing you to sort of draw on this crowd intelligence as you need for things that perhaps you would never think of. Now, back off here for a moment. And point out that the retainer model has started to system ma ties the recruitment process for crowd sourcing. We're changing that recruitment process, by systematizing it we can actually begin to model it and ask what happens when we go from having, say, 20 people on retainer to huge numbers. I won't go into too much detail here, but it turns out you can cast the retainer model using qeueing theory, that is, this is just a formal framework that allows you to sort of understand if workers are arriving at some rate and you can recruit new workers, tasks are arriving at some rate and you can recruit new workers at some other rate, ask questions about how long is the queue, the line. And in particular this is an MMCCQ, which is to say we have C workers on retainer, and if we have any more than that number of requests we're just going to give them a busy signal. Now we can ask, what's the probability that when I need help, that is, there's a task, there's no one left on retainer to help me. We can derive this from Erlang's loss formula in queueing theory. This pi of CUC here as closed form formula. And you can ask what's the expected number of workers on retainer which gives you a better sense of cost. You can then plot those two things against each other and treat it as a minimization problem. You can say how few workers do I need on retainer to have some guarantee of service. Like one in 10,000 chance when someone wants help, there's no one left to help them. This has lots of other applications. You can think about asking -- you can model what happens when you share retainer pools across applications. You can ask what happens when you then start routing tasks to workers to avoid starvation. Or you can do what we call predictive recruitment or precruitment which is this the notion that if we know that the on average a the task is going to arrive within then ext ten seconds and workers will maintain their attention for up to ten seconds, we can actually recall the worker before we have the task. Show them a loading screen for a moment. When we do that, we actually see that we can get feedback in just a half second. It really starts to blur this cognitive boundary between me pressing a button and seeing feedback as sort of cognitively part of that action. So you can push farther on this. But I'll back off here. I'll say at this point I hope I've convinced you we can create real time crowd-powered systems, and that we can introduce techniques in order to support these. Things like the retainer model and rapid refinement. So, yes, question. >>: So you created a model of human behavior by giving people these incentives to stick around and wait for your response, and so you're basically one economic entity in the system who has done this. And the question is, just as in any type of arbitrage system, what happens when everyone else starts running arbitrage? >>: Right, this is exactly ->>: Is it stay the same. >> Michael Bernstein: This is exactly why we want to start asking and modeling what happens when we combine retainers across requesters. What I want to put forward is that the platform could actually support this. You can imagine having two sets of tasks. There's the real time tasks and the non-real timer batch style tasks and you can sort of agree to follow like in the sense of Twitter like I like that request, that kind of task, that kind of task and just give them to me as they come. Otherwise I'll start picking up tasks and the system can actually consider the space of everything I signed up, the space of everything else have signed up for and the tasks coming in and route them to actually sort of keep a globally optimal solution. So it's definitely possible right now on Mechanical Turk to do that kind of arbitrage, right? I'm trying to push forward and say how would you design the next platform to avoid that kind of problem. But your concern is absolutely valid given where we are now. >>: So you're assuming a monopolist buyer. >> Michael Bernstein: I'm assuming what. >>: A monopolist buyer, that there's one system that makes the rules and hands all the requests out to the workers? >> Michael Bernstein: I wouldn't call it a monopolist buyer but you can imagine it that way. Platform support or you could be like crowd flower a middleman where I'll help you get real time workers, and you just sign up through me and I'll help you. But, yes, effectively we're talking in that case about what happens when we centralize. I think if you start splitting and everyone's competing for real time workers, it would work just not as well. For exactly the reason that is your intuition. Okay. So across these two systems, I hopefully have convinced you so far we can create these crowd-powered interfaces these interactive systems that are supporting the kinds of tasks we cannot support with traditional systems the line between user and system. Now, there's a third dimension effectively of crowd. So that we can create these interactive systems that embed crowd intelligence and in order to do that we can start to look at computationally motivated techniques to help the crowds accomplish these tasks. Now, at the beginning I promised that I would push past paid crowds, and I'm going to do that now. There are actually many different kinds of crowds out there on the Web. We can pay crowds, we can create new kinds of crowds. We can mine the activities that crowds have already have gone and taken upon themselves. And I just want to give you a brief tour through sort of that brief space. Because I like to play across all of them with several citations with work I did here actually. I'll start with designing new signs of social computing systems. In particular if you wanted to create a crowd that never existed before. This is work that tends to appear at HCI conferences like Cayenne Wist as well as social computing conferences like ICSM. Our goal is to create new social systems that never existed and understand how to design those systems. Now, I'll give one example, this is work I did with Eric and Desney and Greg and several others on friend sourcing. The notion here is that we may actually want information that a generic crowd would never know. So in particular if I wanted to know what to get Desney for his birthday, Mechanical Turk would have no idea. Yahoo! Answers has no idea but people in this room, his social network really do. By creating incentives over the social network we could encourage people, I would say Mary was also involved in this work, this is what happens when you're on the spot. To encourage them to share these tags. Many people in this room, in fact, some of [inaudible] users. We got tens of thousands of tags on thousands of individuals. In follow-up, we created a system called Feed Me that starts to effectively learn models of people's interests by riding on this activity of people sharing interesting news with each other. When we do this, we can create systems like this. We can route questions like is IUI research tend to appear at the Wist conference. These individuals are tagged with both kinds of -- both IUI and Wist but they never had to sit there and tag themselves with their interests. We were taking advantage of the fact there's a power law here that we can take a small number of individuals who are really active on these social networks and spread out their interest in activities such that it's to the benefit of everyone else in the social network. We can also ask what happens when we take unusual designs in the space of social computing systems. Particularly, you may have heard of 4Tran [phonetic] or /B. They created the anonymous hacker collective which you may have heard of. It's sort of heterodocks community, to say the least. It's an unusual space in the Internet. I don't recommend checking it during the talk. Now, they make some really interesting decisions. One is that by default all posts are anonymous. Two is that they don't keep archives. It's not gooingable. In fact when new content comes in it pushes off older content. We got five and a half million posts from this site and simulated the dynamics of the site to ask what happens in a large scale online community when you have anonymity and efformality as core design tenants. We saw that the median thread lasted just five seconds in the intentional sphere of most people; that is, on the first page. And it was pushed off completely from the site within five minutes. In fact, we also found that over 90 percent of the post were made completely anonymously. We found some interesting ways in which -- there was a suggestion that these exact decisions of anonymity and eformality are what were leading to Fortran's ability to drive Internet culture. If you've seen a lul cat, if you've ever been to rick rolled [phonetic] you've experienced the output of Fortran. So we can think about these kinds of questions of how to design these online communities as well. We can also talk about mining what crowds have already done. This is again work that tends to appear at HCI conferences like Cayenne Wist. .I'll focus here on some work I did over the summer with Sue and Jamie on tail answers. That's right, I missed another one. And Eric. >>: [inaudible] [laughter]. >> Michael Bernstein: All right. So answers you may have been familiar with or one box something like this when you query for weather. In addition to the organic search results you see something like this, which is a result that's been designed specifically perhaps kind of sad in the case of Boston for that kind of query. But we don't have any kind of response for something that is a much less common kind of query like what are the substitutes for molasses, which we know to be actually collectively quite common; that is, they're in the tail. There's a large number of somewhat popular queries. So what we created was something called tail answers, where we can again augment these organic search results with a direct response telling you exactly what you would replace molasses with. In fact, we can create hundreds or thousands of these through an automated process, answering questions like how long does audible stitches last, the story of the invention of the light bulb, how to turn up the volume on Windows XP, and many others that are all sort of collectively somewhat popular. And, I'm sorry, individually somewhat popular and collectively quite popular. So we really do turn to crowd data to make this happen. We can look for search trails, like Ryan has been exploring, where we identify where people start searching, navigating through the Web, and find pages where they have unusually high probability of getting to that page and then ending their search session. If we combine that with looking for the canaries in the coal mine, a small number of searchers who use question words in their queries, like what is the average body temperature of a dog, we can start to identify Web pages where people are finding concise informational answers to their needs. We can then use something that looks a lot like find, fix, verify, to extract that content from the Web and promote it into direct response. So really there's a broad space here of crowd powered interfaces and crowd powered systems. We can talk about how we might pay people, how we can create new kinds of crowds to collect information that's never been collected before, how to look to what crowds have already been doing. So my goal really in the large scale is to integrate social and crowd intelligence directly as a core part of interaction, as software and of computation, more generally. Now, focusing just in the paid crowd spacing there's a lot of ways to get there. We want to think about how we integrate crowds with machine learning, for one. That is, we can already now start to deploy these systems, collect the data and train better machine learning systems. But we can also then take these machine learning systems and use them to make the crowds more effective. For example, in tail answers we found that by using an open information extraction -open information extraction system, we can actually just have the crowds vet the answers, which ends up being much faster and cost less. We can also think about what happens when you say start treating these workers as like stump learners in an ensemble. We want to think about the platform. How would we change O Desk, Mechanical Turk, many of these systems, Top Coder? We've seen what happens when there's a small scale like hundreds of individuals online at once. What happens when everyone is a contractor effectively that when we have hundreds of thousands of people participating, how do we help them develop expertise and notice when they have the expertise? How do we help with lifelong learning. What do benchmarks and complexity look like in this space? If you come up with a better algorithm, how do we actually compare and understand the ways in which it is better. And at a high level we can start to talk about ways we can combine machine and social intelligence to take on these really complex or high level tasks. Think of sort of the big questions helping you write a lecture, write a symphony. Big questions. Now, this work opens up many cans of worms. Now, we don't have enough time to really get into an in depth discussion here. But I want to give you a sense of the kinds of questions that I think about and that I think are important in this space. First is that we have this sort of returning notion of scientific management. How do we think about contract ethics in this space? How do we make sure that people in expectation make a living wage when they're doing piecework. What happens when your software has goals and dreams? That is, there are individuals participating as part of this system and you want to support their interactions socially. You want to give them the opportunities for career advancements. These are all important parts. And I would argue will lead to a better result if we think from that platform side. Finally, we've complicated notions of attribution. Should we have had thousands of authors on the Soylent paper? On the flip side, if there's an error, who is now at fault? So these are just a few of the issues that I think we need to push on. So in the meantime, people have picked up find, fix, verify to start doing things like image segmentation like you can see in the upper right there. Authoring maps. They've modeled it using formal crowd languages. It's been integrated into course work at several universities. And, more broadly, again, while Soylent was one of the first crowd-powered systems I'd like to point out that there are lots of these systems that are really gaining traction in the research and practice space, things helping with blind individuals, translation, databases. It's a big space, and I hope you'll come play with me in it. So I hope I've convinced you in this hour that we can in fact create these crowd-powered systems that are going to enable experiences that you wouldn't be able to accomplish with just machine intelligence, but nor could crowds on their own perhaps do them either, within the symbiotic system that actually plays to both of their strengths. And, more generally, I hope I've convinced you that computation can become a critical component of what's known as the wisdom of crowds. So I'm part of a small crowd of collaborators. My closest mentor is David Miller and Rob Karger at MIT, variety of researchers across many institutions, including here, graduate students, undergraduates, and many others. So thanks to all of them. And at this point I'd be happy to turn it around to discussion and questions. Thanks. [applause] Yes? >>: One question about just your observation of the workers out there. The Mechanical Turk workers. What is the total population of workers is, is it evolving over time, moving east, moving west, states, spreading out? >> Michael Bernstein: I think this will continue to be a question. The most recent information I've seen has sort of moving east, I guess, would be a good characterization. So there are more workers in India than there used to be. I think the model you want to keep in your head here is that people in the U.S. are using it to supplement their income. In India, Bill These is actually doing some great work at MSR India looking at ways in which people are actually replacing their income entirely in ways in which you can say use cell phone platforms, give people cell phones as a way to actually start expanding this. >>: Do you have any known as what is the total size? >> Michael Bernstein: That's a great question. Amazon doesn't say. My estimate would be tens of thousands are signed up, but perhaps hundreds maybe low thousands at any given time. That might even be an overestimate. I think that these platforms have a large space to grow. >>: Still small. >> Michael Bernstein: I said five million tasks a year. That's actually ->>: I was surprised how small ->> Michael Bernstein: So if you think about who is actually using these largely, it's researchers and companies like Crowd Flower that are using it to like verify business listings and so on. Part of what I do in my role here is expressing the much broader space that crowds could really tackle. And by doing so I hope that will push open the boundaries of what can happen. And then you look to things like O Desk where there's real expertise. Like I've hired music engineers to help me create a song for a CAI madness last year. Law Tech people, mathematicians, they exist on these platforms. You'll start to see a continuum from Mechanical Turk which is homogenous and sort of generic intelligence out to things like O Desk where there's real expertise. I promise to come back to you if I wasn't clear. Are you happy? >>: So the question at the beginning was why verify rather than rank. So verify would seem to return a Boolean. And rank would assume that it takes an input of multiple things and can get the best [inaudible] doing one thing and just rank it. So I guess I was saying why be specific if you could be more general? >> Michael Bernstein: You absolutely could. Eric spent time thinking sorting crowd sourcing, another way to do this, that is effectively ranking. Now you have sort of noisy comparators effectively. What we're getting is a histogram for each of those pieces of text, we're getting a number of votes that it's bad. So in a sense we can get a noisy rank from that, and it's just a matter of what you do with it. But you're right, you can push out more generally and consider. Verify is a notion of semantic, that means we're trying to effectively get a notion of what's good and bad, but you can imagine rank being a better term for that if you wanted. Yeah? >>: So one thing people often ask about crowd sourcing, what domains can you apply it to? I think you gave a lot of really compelling examples but maybe you could talk a little bit about generalizing some of the things you talked about, like task decomposition to find, fix, verify, you gave a few examples. But can that -- task decomposition the key to sort of any domain? And how might that play into tasks you might try to do in advance at sort of the system level? >> Michael Bernstein: So I think about this in the following way: Right now, I mean I can view this as a limit, like theorists want to know what's the limit of this. But you can view it as a research challenge, what can we engineer in the future. Weird space there. One thing that crowds are poor at is anything that requires high level knowledge. That is, if you wanted to have your entire paper shortened, there's this orthogonal element where we give every paragraph have it shortened individually. But when I shorten the text it happens by saying this section feels wordy or I could just cut this paragraph entirely. If you want to make those kinds of assertions, imagine you wanted to build a crowd source personal assistant, someone to help order pizza, reserve rooms, set up meetings, they would need to have some globally consistent knowledge of who you are and you don't like anchovies, how do you build those very large scale kinds of pieces of understanding across lots of distributed tasks, seems like a very hard problem. In particular, you can think about how much time it takes someone to get up to speed on a task. In comparison to how much it takes them to actually do the work. So if it takes me forever to sort of figure out what it is I need to do and get the expertise and read the text and I sort of hit yes or no, it's not right now a good match for crowd sourcing, but we don't have a good sense of that curve and how quickly it drops off. That would be a great thing to do. Eric? >>: So interesting topics you've covered. Which particular challenge really gets you excited for the next few years in terms of what brings your attention? >> Michael Bernstein: I think start pushing out bigger. Exactly what I was suggesting. Start pushing out from sort of not toy systems but things that are taking on simple tasks to things that really start solving complex interdependent problems is I think really hard and really exciting, if we can make it happen. >>: Get a sense of technologies or challenges? >> Michael Bernstein: I mean, I think about whether you can start attacking that through sort of sampling-based approaches or whether you could actually create mini management structures within the crowd to start taking on those kinds of things. I have no clue whether that will work but that's sort of what's exciting about it. Certainly I also think really at a high level pushing at bringing crowd data into interactive applications is a hugely underexplored space in HCI right now. And I'm perhaps preaching to the choir here. But I really do think that that's another thing that has huge legs that we can push on. Yes? >>: So you've been talking about this [inaudible] crowd intelligence into software systems, right? >> Michael Bernstein: Yes. >>: Do you have any thoughts on how you might want to enter reliability, like Word spell checker, I know it's not as good as editors but I know it works. But this crowd source system, who knows maybe the workers are not going to be as good as the [inaudible] workers. >> Michael Bernstein: Great point. Another way to put this would be if I run Crowd Proof three times I'm going to get a bunch of different suggestions. I may not get the same things twice. That's a hard problem. But I think that it is something we need to start addressing. I pointed out that reliability in terms of latency was really important. But you're absolutely right that in addition we need to think about sort of reliability in terms of repeatability. Great point. I have nothing to add, but I think it's important. Yes. >>: As I think through like all the different places you have crowd sourcing systems, you've shown at least one or two ways, you can apply to, say, Office and I have a whole bunch of ideas, but I think it totally changes the experience in PowerPoint [inaudible] what other kinds of desktop systems. But at the same time I think one of the things software systems allow you to do, is they let you use it in a way where you're trying to construct something where you're talking about what you're going to do against your competitor. And so software systems are nice because they're kind of like private to you and they're reliable in that way. What do you think about how do you have the crowd as a consultant under NDA or how do you have privacy? How do you prevent your creative works from being ripped off and released before you release them? >> Michael Bernstein: I just published your next paper, by the way. >>: I can appreciate that. If you're in the business of making a creative work. >> Michael Bernstein: That's right. >>: You don't want to have the first chart [inaudible] by the crowd and released on line. >> Michael Bernstein: What you start to see already is that companies are getting contracted crowds not from Mechanical Turk but under NDA. Several companies are doing this. So you can sort of have a sort of your own on demand crowd that you size dynamically as you need based on the size of the enterprise. You can also think about homomorphic crowd sourcing, what would it mean to take actually it and reliably obfuscate the critical parts in a way in which the work can still be done. But also thinking -- you were sort of pointing at what happens as you go off the desktop and what's getting crowd sourced is stuff of where I am at any given point. We're just starting to see people think about this. I think Jason Hong is starting to think about it and his students, but it's going to be a fun ride for sure, at the very least. >> Desney Tan: Any more questions? Thank you. [applause]