>> Ben Zorn: Okay, so welcome everyone. It’s a great pleasure to introduce Emery Berger. Emery is an old friend of Microsoft Research he started here as an intern, say more than 10 years ago, that’s all I’ll say. [laughter] He is currently an associate professor at the University of Massachusetts Amherst. Emery has done great work over the past in many different areas, systems and programming languages. There are always very clever and interesting ideas. They are also very useful, so the Parallel Memory Allocator is still the best most wildly used sort of parallel allocator you can get, which is great. >>: [inaudible]. >> Ben Zorn: Yeah, a paid announcement. [laughter] >> Ben Zorn: But also Emery has received a lot of honors over the years. A Microsoft Research Fellowship, NSF CAREER Award, a Lilly Teaching Fellowship, a Most Influential Paper Award recently at OOPSLA 2012 based in intern work he did here 10 years earlier which is wonderful, a Google Research Award, and a Microsoft SEIF Award. So really, highly acclaimed and it’s a great pleasure to have Emery here today to talk about AutoMan, another really interesting idea. >> Emery Berger: All right. later, your kickback. Thanks Ben and you will get your pay All right, let me jump right in. Let me add up front I mean this is a nice intimate setting and I know that most people at Microsoft don’t feel inhibited from asking questions anyway, but let me encourage you to jump in. So computation has been a real success story. Obviously we have been able to do amazing things with digital computation, right. We have built these algorithms that allowed us to decode the human genome. We can do these incredibly high performance simulations of things like air flow over, you know, my next car. We can do scientific, blood flow simulations with very high precision and very high speed. You know, these games really are doing 3D physics live, which is incredible. So computers are really, everyone knows in this room, computers are amazing. We can do all this great stuff, but there are things that are hard for computers to do. So here is an example of something that’s actually relatively difficult for a computer to do. So is this a giraffe? >>: [inaudible]. >> Emery Berger: Thank you. We have a great fan in the audience. So it turns out that the Java standard library doesn’t actually have this built in, which is a rare exception. So what do you do if you want to actually --? So, you have an app and you have some application, you are going on a safari and you want to be able to recognize all these animals? Okay great, so I write an app, but then I am stuck. This function is a really, really hard function to implement. So what can I do? Yes? >>: [inaudible]. >> Emery Berger: Well it depends, let’s test. How many people think that picture is a giraffe? Wow so it is harder, it’s really hard, only 3 Microsoft engineers can actually this question. So in principal you can ask a bunch of people, right. I think it’s easier for people in general to recognize this then to write a computer program to do it. So you can ask these people and of course there is this thing that most of you have probably heard of called Mechanical Turk. So how many of you have heard of Amazons Mechanical Turk, raise your hand. Okay, so there are a couple of hands that didn’t go up so I will explain briefly. So Amazon is this company in Seattle that you all should be aware of. It delivers books; here it delivers groceries which is hilarious. But it also has in addition to a number of other things they do like running the bottom third of the web they also have this thing called Mechanical Turk. And Mechanical Turk is this marketplace for work. So what you can do is you can post things and there will be people out there who will go and do them. So you can make money yourself. So you can sign up to Mechanical Turk and it says you can work on hits. So a hit is a Human Intelligence Task, it’s just some thing that may be hard for a computer to do that you farm out to people. So you can get paid, not much as we will get into, but you can get paid. And of course you can become a sort of employer. You can get results from Mechanical Turk. So how many of you have used Mechanical Turk? Oh, far fewer, so you all should use it. Not right now, but you all should use it. So you just go to mturk.com, you sign up and you should actually take a look at what’s there and you are not going to get rich by doing any of these tasks, but it’s quite interesting. So here is an example hit that we pulled down. So this is a real hit, that is to say human intelligence task and a hit generally consists of a number of parts. It’s got text which is the actual question; it’s got a number of question choices. So here it says please enter the name of the product being advertised. You have got a picture of a couple of hipsters and it say’s Levi’s. So it’s no necessarily clear what’s being advertised. But it says, “Please enter the name of the product being advertised”. This is not an add, the image failed to load; the product or company name is not known and crucially contains adult content. So it turns out that in fact Mechanical Turk is used for a lot of this kind of filtering, because it’s relatively difficult for computers to decide whether something contains adult content or now. So in addition to the question there is all this stuff at the top. So here it says that you get 10 --. Yeah, Mike? >>: So you are saying that’s actually what, the person who created that task, that’s really what they wanted to know? >> Emery Berger: So actually I think, no. So actually if you look right here it says Classify Add Images. So we don’t know from this who it is, but I can tell you there are a number of sites that have shopping ability like search portals that in fact use this for filtering. So they use it in a number of ways. So for example it went and it pulled up this image next to some object like jeans. So it said somebody went to go shopping for jeans, it pulled up this image, is this really an image of the jeans? That’s important to know because if they put up the image and it’s useless it’s not very helpful. And if it’s the wrong image and if it’s a really, really wrong image that’s even worse. So I think they really do want to know all of these things. So getting back to this, here any task that you get has a timer associated with it. So when you go there are all these tasks that are available, they are posted and you can go and say, “I am going to do this task”. And then you get some amount of time for which the task is yours. And that’s how much time you get to work on it. So here you get 10 minutes. For each one you decide if you want to accept it or not. So it’s completely optional. Then here, this is how much money you get. You get .06 cents which doesn’t seem like much and it’s certainly not much if you took 10 minutes. On the other hand, this really should take you far less than that and you could burn through quite a few of them. Also, it’s important to note that a good fraction of Amazon’s Mechanical Turk workforce is in the Indian subcontinent, more than a third. So the wage differential is substantial. And one other thing is people just do this for supplementary income. So it’s better than Minesweeper I would say. Okay, so these are all the pieces you need to know. You can see a couple other things here. There is some crazy qualification stuff that I will get into a little bit more. And there are some US English things so you would need to be an English speaker. So you can filter out kind of for demographics as well. So how many of you know why this is called the Mechanical Turk? All right a few of you. It’s a really great story. So this is the original Mechanical Turk. The original Mechanical Turk was this Victorian era chess playing champion. It’s a machine that defeated chess masters around the continent. It eventually came to the United States and was defeating chess masters there. And there is something wrong with this, right. This seems a little confusing, like Victorian era, chess playing computer, it seems a little weird. And you would open it up and there were actually like gears and there was steam coming out and it looked very impressive. But, it turns out that behind that panel was a person inside. So there was actually a very skilled chess player who hid inside the machine and drove this thing like a puppet. It was actually quite large and then he had –-. So when the chess pieces were pushed down there was this complicated system of levers and he would see what was being played and then he actually had his own little chess board here that he would play and then he would move the thing to make the next move. So there you go. So the joke with Amazon is that it looks like a computer, but really there are people inside. So there is actually an API that let’s you access Amazons Mechanical Turk and of course it’s really people. So you think, “Okay, done, this is how we solve the problem”, isGiraffe? You go and post it on Mechanical Turk. I should add that for those of you who have not actually tried to post something on Mechanical Turk it’s actually a huge pain in the ass, because what you do is you go and create the web form manually. You put this stuff up and then you say how much time you want to allot, what you want to do, fire it off, then you wait, eventually something happens and then you get a list of sequences of responses and then you can kind of download them into a CSV and then you have to manually approve them or not. So it’s a real pain. All right. But, great we can do it right. We can answer this problem. So what do you need to know? So if we are going to go ahead and we are going to pay all these people we need to figure out how much we are going to pay them. So is .06 cents worth it for this computation? It’s sort of like, maybe .06 cents. If you pay people too little they won’t do the task; if you pay them too much then you wasted money. In addition you need to decide how much time the task should take. So we can all agree that isGiraffe shouldn’t take much time, but there could be maybe [indiscernible] example there is some really hard to figure out image of a giraffe. It may be there is a giraffe and it’s hidden very carefully behind leaves. So that might be something that would take longer. And then finally there is something really bad that could happen. So we handled money and time, those are complications. What’s the real big problem here? >>: Latency? >> Emery Berger: Ah, so latency is an issue for sure, but that’s not the biggest problem. >> Correctness. >> Emery Berger: Correctness, yeah. So what if they say no? Okay, so the whole point in outsourcing this is that you are not going to check it, because if you were going to check it you could have just done it yourself. And in fact what you want to do is you want to use this in particular for problems that are huge where there are thousands or millions of things that you want done, not just one thing. And certainly you don’t want to involve yourself in the loop and now this person get’s it wrong. So how could this happen? So it could happen for a variety of reasons. The person could just be stupid. The person is like, “Well that’s not a giraffe, giraffe’s are you know, those big grey things with tusks”, right just a moron. So this is the Homer problem. This is a more serious concern. So if you take the internet and you add money you get bots, right, it’s like a formula. So basically people have written bots that just go and take tasks of Mechanical Turk and answer them. And the idea is that they are going to make money. Now of course what people have not done is said, “Oh you have a task on figuring out adult content. I will write an AI that can figure out adult content and then I will solve all these things”. That would be awesome actually, right. But that’s not what’s happening. What’s really happening? >>: Random. >> Emery Berger: Random, right. So what they are doing is they are effectively flipping coins. So they have a bunch of tasks and they pick something at random. So that’s clearly a concern. Now there is one other way that this could go south. So I mentioned morons, the Homer problem and the bot problem. What else could happen? >>: Ambiguity. >> Emery Berger: Ah, so ambiguity is an issue, that’s a good point and let’s table that for now, but you are right that is an issue. So let’s say there is a more serious potential concern. >>: Nobody takes a hit. >> Emery Berger: That is a concern, but luckily it turns out that if you pay people enough people will do almost anything, so yay for capitalism. So here is the problem. We call this the evil genus problem. So the evil genus problem this is sort of like a Byzantine fault tolerance situation. The evil genus problem knows what the answer is, but always picks the wrong answer. So this is somebody who is just out to screw you for some reason. So this is the kind of thing that’s typically sort of like this is the way we construct the world in academia. We say, “Well let’s pick the worst possible situation” and then we do this Byzantine fault tolerance thing. I am going to argue that this doesn’t really happen for a variety of good reasons. So I am going to explain why. What we are going to use is something not as strong as Byzantine fault tolerance, which actually makes is impossible for you to make almost any progress on this problem. We adopt what we call a random adversary model. So the adversary we care about is the one who is clicking on things randomly. That is to say that we care about protecting ourselves against Homer and Bender, this is being recorded so I am expecting a cease and desist order anytime soon. But we don’t actually care about protecting ourselves against Dr. Evil. So why can I say that we can do this? So here are the reasons why. These are features of the Amazon Marketplace that make this a reasonable assumption. So first I didn’t show you in much detail any of these qualifications, but almost every task comes with these qualification requirements attached. And a lot of the qualification requirements say things like, “Your total approved hits are greater than 1,000". That is you have done 1,000 tasks and they have all been approved, you got them right or there is some long-term hit approval rate. So it says that your approval rate is greater than 95 percent. That means 95 percent of the tasks you agreed to do were approved. And then there are other things like, location is in the US and so on, but this sort of thing is pretty typical. That is there is a long-term financial incentive to stay in the system and do things well. And it keeps track of your overall percentage and once your percentage drops below a certain threshold game over. You won’t be able to get any work anymore. So you might thing, “Well this is –“. Yeah, [indiscernible] go. >>: [inaudible]. >> Emery Berger: Say it again please. >>: The approval is done by humans on the other side? >> Emery Berger: Yeah, so there are a number of ways approval can be done. You as the person who is outsourcing stuff are responsible for approving things. So you can do through and approve things and there are number of ways you can do it; not many of them scale. Yes? >>: So I mean isn’t that the problem you said originally is that you have to look at the answer yourself; so how is that actually done? >> Emery Berger: So the standard way that people do it right now is they do sampling and they do what’s called gold seeding. So gold seeding is where you have a known answer and then you send out the question to everyone anyway and you verify that they actually answered it correctly. >>: And so with all the other answers would you just say you approve them automatically? >> Emery Berger: Yeah, so in fact the way it works in Mechanical Turk is I believe there is a 48 hour timeout if you do nothing. The task get’s approved, but you can also manually approve them. And the API allows you to programmatically approve things. So a pretty typical strategy is to actually require people to pass some sort of a test first before they can sign on. Then they get a qualification which is some giant MD5 hash which get’s added to their record and then they can do your tasks. Yeah? >>: Do you still pay if you reject an answer? >> Emery Berger: It’s up to you; AutoMan does not, but that’s a policy question. >>: But I mean what’s to stop people from rejecting everything? >> Emery Berger: Yeah, that’s a great question. So it turns out, this is a really interesting situation. So Mechanical Turk doesn’t really police that side. So Amazon doesn’t go and police that people are not being evil exploited employers. So in a sense what they have is they have qualifications for employees, but not for employers. So it’s kind of an asymmetric situation. So what they do, what happens with these situations in the real world? So in the real world there is sort of maybe course grade that this is a good company to work for and this is a bad company to work for. There is this kind of reputation that happens, but there is also this thing called unions. And the unions get together and fight for people. So there is a de facto union called Turker Nation and there is a website and they keep track of reputation and complaints about different employers. They even have a plug in for your browser. when you hover the mouse over the employer it comes up with the reputation score, which is pretty cool. So Okay, Ben? >>: So if you reject it right, after awhile someone will start complaining fairly vocally as well. >> Emery Berger: Yeah, so there are number of things you can do. So it’s complicated because the price that you pay for a task that’s completed is fixed and must be announced up front. And what you would like to be able to do is actually adjust the price. Not maybe pay somebody zero and say, “Well you put in some effort, but you didn’t get it right” or just something to shut them up to be honest. And that’s a bit challenging, but I will say --. Well let’s talk about this a little bit later because we actually get to this. So this whole discussion is about why you don’t really have to do about evil doers on the system sitting there, lurking and waiting to answer your tasks incorrectly. The other reason is, so you might think of this sort of reputation score. How do we gain reputation scores? >>: [inaudible]. >> Emery Berger: Exactly, so this is known as a Sybil attack. You create multiple entities and you just create, like I will create a bunch of tasks for me and by golly I did them all right. My approval score goes way up and I do tons of tasks. So it turns out that conveniently and this is actually really crucial, the way that Amazon actually grants accounts is they are tied to your financial credentials. And in this post 911 world it turns out it’s actually quite difficult to create new financial credentials. The financial credentials are tied to your identity. You can’t get financial credentials without an identity. They know who you are and there is only one of you allowed to have a particular account. So you could open 10 bank accounts, but they are all you and they know that it’s you. So this makes it very, very difficult to initiate a Sybil attack. >>: I don’t understand. And Amazon might know that there are 10 ID’s that are you, but Turker Nation doesn’t. >> Emery Berger: No, that’s true, but you can’t --. So Amazon is the one that tracks your score. So the idea is that we are going to use a Sybil attack to create an account, ike I am the evil genus. How will I be able to get work? >>: Oh, you are the worker not the employer. >> Emery Berger: Yeah, that’s right, that’s right. So I want to be able to somehow get in past all these qualifications which having to actually have been a good employee. So really what we are left with is you actually have to be a good employee and then kind of go rogue at some point. So you have to have done 1,000 tasks well and then suddenly be like, “Now I am evil; the mask is off”. And then your qualification rate will quickly drop and then game over. All right. So fine, yes? >>: This seems to be a similar problem to what I saw. No matter what the bad guys do they have to --. The [indiscernible]. >> Emery Berger: So I am not really familiar with this particular thing. Perhaps you can give me some more details offline. All right, cool. So now we are in the situation as I said there are no evil geniuses, but we still have Homer and we have Bender. Homer is the moron; Bender is the bot who is acting randomly. So now we are still like, “What do we do?” We ask one person, that’s maybe not enough. So what else could you do? >>: Ask two. >> Emery Berger: Ask two, right. So this is a pretty generic and obvious thing to do, right, ask two people. Ask three, oh okay. So how many people should we ask? Do I hear four, do I hear five? All right, so you won. Yeah? >>: So how do the random ones get higher then just 95 percent approval ratings? That seems --. >> Emery Berger: Yeah, so that’s an interesting question. There is this sort of the fact that approvals happen in a kind of sampling based way. It means that it is possible for you to get through. It’s an excellent question. It’s basically based on the inability of the employer to screen all of the answers, because at scale you just can’t do it so they use sampling. >>: So if you sample less than 5 percent your approval ratings automatically go higher. >> Emery Berger: Yeah, that’s right, certainly. people do is sort of --. So one thing that >>: Another thing that will happen is less then 95 percent. >> Emery Berger: No, that’s not really what happens. But that being said there are definitely random actors in the system, but here is the problem, once you get caught then you are done. So you are still you, you are still Manuel [indiscernible]. You can’t be like, “I can never work again, because –“. >>: But I don’t understand how I get caught. Even if I answer some of the answers wrong if the controls are not checked my approval ratings are still above 95 percent. >> Emery Berger: So it depends on the frequency of checks, you are right. >>: Clearly you could put more scrutiny on the ones that you found in the sample that were wrong. So it’s like a two tier thing. >>: Right, but unless Amazon helps with that --. >> Emery Berger: Yeah, yeah, that’s right. The issue of Amazon helping has been one that many people broached. Like wouldn’t it be nice if you allowed us to do a little bit more? And so they very grudgingly incorporated new features into the system. So one thing they recently rolled out is they have a sort of selected crowd that has been more vetted and so those guys, you have to pay more to use that crowd and then nominally there is a quality guarantee. No I should add the problem here is not just --. So there is this aggregated problem. There is this whole worker pool and there is people acting randomly. And maybe I will get lucky and I will get good ones, but for my computation I actually want the correct answer. I really do want to know if it’s a giraffe. I don’t want to be like, “Um, maybe it’s a giraffe” and hopefully there will be good people who answer. That’s where we are right now. So if you adopt this kind of random adversary model and you look at this redundancy you see that this is not enough; this doesn’t work. So let me give you an example: Suppose these guys are flipping coins. What are the odds that they will agree randomly? It’s 50 percent? So you could have flipped the coin yourself. If you have an isGiraffe you could have flipped the coin yourself. You didn’t have to hire two people. So you officially wasted two peoples worth of employ and you got nothing. So that sucks; that’s not very useful. But it turns out if you think about it this is this terrible probability of agreement because we flipped two coins. But if we flip six coins the chances of them agreeing at random suddenly are very, very low. So if you view this through this prism of our adversary is random what can we do to minimize the chance that there agreement is due to just random probability? Then we can adopt this kind of approach. So in general if you have unanimous agreement it’s obvious what the formula is. It’s K x 1/k to the N where K is the number of choices. So the worst case is where there are only two choices and if it’s unanimous then you have a smaller number. It turns out that this generalizes and you can say you have some number, even less than a plurality and you can still determine or rule out the likelihood that something random happened due to random chance. I am not going to show you formulas here. This is actually unanimity and the number of choices. And this is just to show you that when you have a larger number of choices like six choices, if you ask two people the chance of them agreeing at random is quite low. >>: That’s only if it’s random right. the first choice? >> Emery Berger: Yeah, so that’s good. slides, excellent observation. What if people are just choosing We will get there in two All right. So we built this system that’s based on this principal of assuming that we have random adversaries and using kind of consensus to drive the correctness, to actually determine quality. It’s written on top of Scala which is actually both good and bad. It’s bad because Scala syntax is sort of arbitrary. You can do all these amazing things with it. It’s good because you can do all these amazing things with is. It’s really a substrate for building domain specific languages. One of the best things with about it is that it interacts with Java, which seems weird coming from me to say, “Like, best thing is it works with Java”. But it means there is a ton and ton of code out there that you might want to do for an app for example. And yet there is this human piece that you would like to use. So what we did is we built this DSL that allows you to write functions, just a function and it actually is implemented by people. And then that function you can use anywhere in your program like it was an ordinary digital function. So the weird things about using AutoMan, here are the strange things. So first you have to pay people. So now you have to think about the total amount of dollars you use for your computation, which you normally don’t think about. The way Amazon works is it’s sort of like old school video games where you put in quarters and you get credit. So you can kind of refill your computation up to a certain point. In fact if it runs out it throws an exception. It’s an out of money exception. [laughter] >> Emery Berger: Yeah, so this is kind of a different model. That said, it’s sort of like, so I don’t see, maybe Jim has experienced this on super computers. So if you have ever submitted jobs to a super computer system it’s exactly like this. You have an account associated with and some amount of money is taken away. And I assume that app posting platforms like Azure and things like that have a similar sort of model. All right, so there’s that. The other is now your functions, whenever you ask people, people are going to be imperfect. You are never going to get 100 percent assurance that this is correct. So now we have a confidence level associated with functions. So the default is 95 percent, the gold standard for social sciences. But anyway, you can set this to be whatever you want. What this means is the probability that the result that you obtained is due to random chance is less than .05. And you can dial this up or down. So you can kind of trade off quality and money directly. Okay. So I mentioned there are these things that AutoMan has to do --. >>: Is that trade off understandable? >> Emery Berger: It’s understandable in one sense, but I think that it’s understandable actually in two dimensions. So I don’t know for sure how much money I will end up having to pay, but if I get a smaller number of people and they reach census quickly then I can say, “Well I have reached a lower threshold of confidence”. So the way that AutoMan actually works --. >>: But I can’t predict what it would cost for me to raise it to a 99 percent confidence. >> Emery Berger: You can predict what it would be at minimum, but not at maximum. That said, I didn’t mention this, but there is confidence level overall, but you can also have a confidence level for the like individual invocation of a function. So you can say, “All right, there are a million animal classification tasks and I normally want to pay them as little as possible, but I definitely want to pay no more than a dollar for any of them”. So you can do that. So there is the overall computation and then there is per function as well. Manuel? >>: Is there a minimization of this function? >> Emery Berger: Yes, yeah. database, so it’s stored. All the results go into I think is a MySQL >>: [inaudible]. >> Emery Berger: Yes, that’s right; they are pure functions, exactly. >>: So if I understand this, the function is if you have this overall confidence that means that some invocations of the function could have very low confidence and other’s high confidence, is that right or is this 95 percent per result? >> Emery Berger: Yeah, okay, so let’s see. So if I have a function and I establish a confidence level and the confidence level is actually over every invocation of the function. That’s the way AutoMan works right now. So I have got a function isGiraffe and every single result I have a certain P value associated with it. And then there is this money thing that let’s you control how much you will spend on any invocation. >>: Okay, I see. It can be that there are sort of semi-ambiguous situations where there is not very quick agreement, but there is a bias. And that means that AutoMan will have to ask more people to sort of figure out with high probability what the right answer is and you can cap that. But yeah, the way AutoMan works right now is you cannot say you want 95percent confidence in 1,000 tasks and then the individual ones some of them could be zero. We want to do that, but it’s no implemented yet. >>: So what does it mean the 95 percent? It means that if they answer at random I would accept the result with this probability? >> Emery Berger: You would accept the result with less than 5 percent chance. So it’s one minus that. >>: So if this looks like a giraffe, but is not a giraffe? I mean if there are some confusing, like maybe it looks like a lion, but it’s a giraffe and there are these two --. >> Emery Berger: That’s an awesome lion, but okay. [laughter] >>: So there are these two people who give me this consistent lion answer. >> Emery Berger: Ah, yeah, so this is a good question. So the question that you are asking really is: Is what people say really true? All right. So we adopt the Steven Colbert version of reality, which we call “truthiness”. So if everybody agrees it is true then it’s true. There is really nothing else we can do. So my example of this is, so you know the expensive car that has this logo, it looks like this, comes from Germany, the brand is Mercedes Benz, who is the manufacturer? So I kind of tricked you here already or I lead in that this is a trick question. The manufacturer is not Mercedes Benz, because Mercedes Benz is not a company. But if you as people at random they will almost certainly tell you that it’s manufactured by Mercedes Benz, but it’s actually Daimer. >>: Right, but if like a third of the world would say that it’s a lion and two-thirds would say it’s a giraffe, what’s the confidence level it’s a giraffe? >> Emery Berger: So if you assume that the sampling is uniform and random then you will --. >>: It’s easy. >> Emery Berger: Yeah, then it’s easy, that’s right. >>: So that’s how it’s done. >> Emery Berger: Yeah, yeah. Okay. All right. So I mentioned in addition to correctness, which I have been focusing on, there are these other issues. These issues are actually fairly easily dispensed with somewhat surprisingly. So first what we do, it’s not yet happened, but it’s almost certain to happen that when you contract people out through Amazon you are going to be considered an employer. And at some point minimum wage laws are going to kick in. So just to stay on the right side of the law the default, which you can change, is that AutoMan pays US minimum wage. All right. So that’s how much we pay, the question is still how much time to allot to tasks and do we pay more? Manuel? >>: I mean the task shows 10 minutes .06 cents; that can add up. >> Emery Berger: No, it does, that wasn’t mine. okay. We didn’t post that, So what we would do, this is the way it works in AutoMan, every task starts out as 30 seconds and time is money. So we pay a wage and the wage right now is minimum wage. So it would be .06 cents for a 30 seconds task, which is the currently desultory US minimum wage. This is the actual minimum wage, Federal, 7.25 divided by 120. >>: It’s 10.00 in Washington. >> Emery Berger: Yeah, states can have it be higher, but the Federal is the minimum. >>: [inaudible]. >> Emery Berger: You know, that’s --. It’s like you know how it’s your responsibility to pay state taxes for stuff you buy online, right. >>: [inaudible]. [laughter] >>: [indiscernible]. >> Emery Berger: Well, so there is --. Well, only because of the state tax is that what you are referring to? I mean obviously you can pay more. You can pay more or you can pay less, this is just a parameter to the system. So you can actually say wage equals something. >>: I think more seriously you can sort of create pressure, right. So you can basically do [inaudible], right. You can have 20 second tasks, 25 second tasks, etc. >> Emery Berger: That’s right, so what happens actually, so this is in fact --. Oh, nice, that’s weird, stupid MAC. Yeah, blame the Microsoft people who work on MAC products. All right. So anyway here is the way a task starts off and what could happen? So nobody could show up to do the task or not enough people show up to do the task. What does that mean? It could mean two things. It could mean one the task is just too short a time allotment. Like people tried to do the tasks, but it took longer than 30 seconds. The other is people looked at the amount of pay and said, “Screw this”, right. So there is no real way for AutoMan to know so what it does is it actually doubles both. That is to say it just doubles the amount of time, but the wage rate stays the same. So I argue that this actually prevents gaming even if you know that AutoMan is using this strategy, which may not be obvious so I will walk through an intuitive example. >>: I am not sure what [inaudible] behind that was. Much of the time they don’t buy these because the task description is boring. >> Emery Berger: Oh, yeah. >>: And then what happens is that you double the time and eventually it looks like it is way too lengthy and boring. >> Emery Berger: So let me tell you there is some weird phenomenon. So first I believe that if you pay people enough they will do almost anything. So I think that a task that’s really, really boring if you offer them a buck to do something that’s really simple and is going to take them 10 seconds they will probably do it. So as the pay goes up people definitely are more attracted. The other thing is that this has this weird side effect on Amazon’s Mechanical Turk because Mechanical Turk just has a strange user interface where freshly posted tasks are up at the top. So if you re-post a task at a higher rate not only is it worth more it also appears at the top. So it has a kind of nice --. That’s not something that we rely on, but it’s a happy circumstance. All right, so here’s the gaming situation. So suppose you are like, “Hey, AutoMan doubles things, I am going to become rich at Emery’s expense”. So how am I going to do this? So I am going to see money coming in for some reason denominated in euros. So I see this money that comes and I think, “Well it’s 5 euros, but I will just wait” and the task goes away, get’s re-posted and now it’s 10 euros. So it’s just a waiting game, right. I just need to wait until it becomes larger, and larger, and larger and at some point it becomes a giant stack of money. And then I jump in and grab the task. So what’s the problem with this strategy? It’s people, stupid other people. There is always this problem that if the marketplace doesn’t have just one person waiting there might be somebody else and that person has some probability of seeing that 20 euro bill and taking it away and making the dreams vanish of the giant stack of money. So what really happens is that it’s complicated; it’s not just I need to wait a certain amount of time. I need to take into account the likelihood that somebody is going to snipe the task from me. They are going to take it away and do it. So this is just a little, very, very simple bit of math. What is your expected gain? So your expected gain is the base rate of pay and then every rounded doubles. So the multiplier is two and round one, round two and round three it get’s two times, four times, eight times larger, but there is this probability that the task is still available after all these rounds. So we model it as if it’s a fixed value. You could be more complicated, but this is sufficient. So let’s say that the P available is a half. I have 50/50 odds that every time a task is posted somebody else will do it before I get to do it. And then the multiplier is two and through the power of math those cancel. So your expected gain is fixed, which means there is actually no incentive to ever wait. >>: Well, but you took that out of your half right? >> Emery Berger: Well half turns out to be the maximum likelihood estimator for this. So actually that’s what you would do if you have no information. >>: It’s probably even slower. >> Emery Berger: Yeah, so if the likelihood that it’s available, if it’s one there is nobody else in the system, right. The likelihood that it’s available in fact is likely to drop as the value of the task goes up, which means that this would even be a negative incentive. Okay. So good, we have got the money and the time thing taken care of. It goes, it asks people and it obtains confidence. If it doesn’t get the confidence on the first round it goes and asks more people and so on until it achieves the desired confidence and then returns a value. So everybody should be on board with this. Let me just observe that this random adversary thing has some weird effects. One of them is that you no longer even need a majority. It’s not necessary or sufficient to achieve your desired consensus, right. So, if I have a bunch of coins, let’s say they are dice, so they are six sided the chance of you getting random agreement of a large enough subset actually get’s to be pretty small pretty fast. This is the number of people involved, this is the number of choices, so here is a die and as you add more and more people the number that you need to reach 95 percent confidence which is this access here get’s lower and lower and eventually drops below the 50 percent line. So it’s just an interesting --. You can’t think of this really usefully in terms of how many people, majority and all of this stuff, really the random adversary model is crucial. Okay. So this goes to [indiscernible] question. So here is an example kind of task that you can do. So this is what we call a RadioButton task. You can choose one of K answers as the actual correct answer. So how many giraffes are in this picture; none, more than one and one? Okay, so the K is just three here. So what’s wrong with this approach? So put yourself, if you will, in Homer and Bender’s shoes. What would you do if you were a bot? Which one would you click, maybe the first one? So there is a problem here that Homer is lazy. So random is actually good from our perspective. What’s bad is suppose everybody clicks the first one. And then they will all agree and we are like, “Oh, crap” because from AutoMan’s perspective it’s like, “Yay, instant agreement, the answer is none, no giraffes”. So that’s not good. So we solve this trivially by randomizing all the answers. So anybody who sees any posting sees a differently shuffled version of the answers. So if they behave randomly, if they behave in a fixed way I should say, with a bias that’s not actually bias towards the correct answer then they will look like noise which is good for us. Todd? >>: So did you see this in practice? I mean this clearly is a nice feature, but you must have seen this in practice if you actually added it to the system. >> Emery Berger: Well, we actually built the system in a very principal way to start out with and then we deployed the system. Then the question really is did we ever observe people who were acting randomly and I think we did. >>: Did you see it any other way though? click the first button more often? In the sense that people >> Emery Berger: Yeah, so people --. I don’t remember if we actually went through and looked at that, but we definitely got people answering things in a way that made us highly suspicious that they were doing anything but just clicking something. >>: Well clicking something is not going to [inaudible]. >> Emery Berger: So if everybody clicks something randomly then we will never, probably, unless we ask a ton of people get to any high level of confidence. But again, if everybody in the marketplace is just acting randomly then the marketplace has no value. >>: So does Amazon know whether there are bots and do they do anything about it? >> Emery Berger: So certainly they are aware because people have pointed it out to them. I believe that the bots that are used are mostly for tasks that are highly repetitive and difficult to verify. So that is for example --. >>: So the [inaudible] choose who will deploy the robots. >> Emery Berger: Yeah, that’s right. So what they do is they look for a task where there are like thousands, and thousands and thousands of them posted. So the odds of them are being checked are low, so scrutiny is unlikely. Then ones that don’t have fill in the blank or any other sort of structure so that they can deploy the thing randomly. >>: What do you mean they see thousands of tasks? individual? Aren’t all the tasks >> Emery Berger: No, so what they do, I think I showed this before, but maybe not. So it says .06 cents per hit and then there are a sum number of hits that are available. So what workers really like to do and real human workers as well really like to do this, they like to sort of learn how to do a task and then do that task over and over again. They love that, I mean it’s something that everybody has noted and it really is true. They kind of get trained and they amortize the training over doing many of them. >>: So randomizing the order doesn’t seem to help in like that case. If all of them have none as one of the options it would be two lines of code had it always picked the same answer. >> Emery Berger: Yeah, right. So the question is if I wrote a bot and actually looked at the text in the answers and then it’s picked some bias on the answers it certainly doable, certainly if they knew there was some sort of structure here. A: nobody does it, luckily. B: there is no good general way to solve this. The only way to solve it --. >>: Multiple trace captures. >> Emery Berger: Yeah, so, I am not sure about multiple trace captures, but you can buy people, you can buy bulk rate capture solvers at ridiculously cheap rates. So there are people in Vietnam who will solve 10,000 captures for you for like 10 dollars. So they are just like, bing, bing, bing, bing, typing things in; it’s crazy. >>: So I am curious, suppose that my task here is Where’s Waldo? So a variant to Where’s Waldo where there are multiple Waldo’s and maybe there are 10 in the actual picture and they are progressively harder to find. So when you ask them how many Waldo’s are in this picture someone is very easily going to say one and maybe there are 10. It’s going to take them a lot of effort to see 10. >> Emery Berger: So again, I think this comes down to truthiness right. >>: Correct and I totally understand that, but my question is this though: What’s your programming model API? Suppose I actually care, not there are less than 10. If there is a 2 percent chance that there are greater than 7 then that means something to me. Maybe I want to give a million dollars to somebody. So it’s a difference flip then on the multi-nomial that you have been talking about, right. >> Emery Berger: It is, yeah, that’s intriguing. What we have been assuming is that you ask anybody. So you pick a pool of people who are either equally qualified or who are self selecting and answer questions. You don’t expect on Mechanical Turk to be able to find Scala programmers or not many. But you expect to find people who can recognize giraffes and the people who don’t know what giraffes aren’t going to except the hit. Their skills, we assume, are more or less uniform. So if you say, “Well here’s a question” what’s going to happen is that if everybody --. If the question has this huge drop off, like you can find the first one super easily, but then you have to be unbelievably anal to find the rest then what’s going to happen is there is going to be this huge bias in the distribution of answers and you are going to end up hitting that. If however the curve looks a little different, it’s a little smoother of a decline then you are going to end up with a bias towards one thing, but it’s going to keep asking more people. And it all depends on where the majority lies. But it really is about consensus right now; it’s not looking for an outlier. Now we have subsequent work that we are working on right now which is actually about obtaining distributions. So, where there is no one correct answer necessarily; there may be a distribution we care about and that’s something we are working on right now. All right. So let me continue just talking about more we can do and I will actually show some code in a minute, but the API is very simple. So this is a question, in fact a question to which I believe we had a bogus responder. We actually ran this, this seems like a totally goofy thing to ask like which ones are from Sesame Street. What we actually did was we said, “Which of these do or do not belong?” And it was a way of doing kind of clustering with people being the metric. is what we call a check box question. But this So here every one of them could be on or off. So instead of it being five choices it’s actually two to the five choices. So the chance of them agreeing randomly is very, very low except that if you present it like this people will think, “Great” and they will just agree. Or if you present it like this then great they will just agree so we randomly permute them. We turn off and on all these check boxes at random and so if people just say, “Yeah, that’s it” then it goes and it just looks like noise. >>: Isn’t that weird to have the answers filled in? >> Emery Berger: Is it weird? Yeah, there is some weirdness. One of the other weird things is, we haven’t implemented this, but we also have a way of doing drop downs. And the drop down, one of the things that we want to do if you think about it what’s the easy, lazy thing to do? Click the first thing, right, so that’s a problem. So what we do, this is not deployed but we have done it, is that you could think, “Oh, well I just randomly shuffle them”, but nobody would want to use that. Like here are all 50 states or here are the 200 countries, find your country. Nobody wants to do that, right. So what instead we do is we just imagine the thing as a ring and we randomly pick a starting point and then it wraps around. So that avoids that bias. >>: So does Mechanical Turk have native support for this? >> Emery Berger: It does not; it has no native support for anything practically. >>: So have you thought about proposing it to them? >> Emery Berger: So I have reached out to them. I actually should probably try to do that while I am here to be honest. >>: I mean yeah, you are pretty close and realistically it seems like -. >> Emery Berger: I know somebody who works there and I know that the Mechanical Turk group is very small, but that Jeff Bezos really likes them. Like he loves Mechanical Turk, this is one of his pet projects. >>: [inaudible]. [laughter] >> Emery Berger: Right, it might be an improvement, yes. >>: But you need semi-total evidence about the importance of it being the first choice on the list, ask any politician. They fight to get that spot under the left column. >> Emery Berger: Sure, yeah, no, absolutely, that’s certainly true and unfortunately the randomization is one time for all the ballots that everybody sees. >>: [inaudible]. >> Emery Berger: Really, because it’s electronic voting. >>: Do you know how Schwarzenegger was elected? >> Emery Berger: Because he is like big and strong and talks like this? >>: They randomized the alphabet and rotated it. >> Emery Berger: Yeah and so you think that’s how he got elected? >>: No, but. >> Emery Berger: I and Arnold would disagree. worried about the latter I guess. You should be more Anyway, so what else can we do? So we have this RadioButton. We have check box questions and we also have constrained free text. So here this is, “What does this license plate say?” And you have a way of actually answering a question like this. You can’t just type in any arbitrary thing. Here it’s sort of a bounded length regex, but it turns out people are not that good at regex in practice so we make is simpler. We use something that’s actually adapted from FORTRAN, from FORTRAN, no why did I say FORTRAN? It’s from COBOL, from COBOL. So these are picture clauses from COBOL, so I got to site COBOL that was awesome. So each letter, the number of letters is the total number of characters and each letter has some meaning. So X means alpha numeric, and a 9 means numeric only, etc. So there are a few choices like this. Okay. So clearly the range of possibilities is now very high; it’s staggeringly high. So it’s thirty-sixth to the sixth if they were choosing at random. So really you just need two people to agree and you are done. >>: Sorry, what do you present to the user in this case? >> Emery Berger: They get a form, they fill it in, and it has got JavaScript in it and so. >>: No, I mean in terms of --. >> Emery Berger: Oh, so there’s the programmer and then there’s the worker. The worker sees a blank and it says how many characters it is and then it’s constrained just like when you fill out a normal form anywhere. >>: So what’s to stop them from typing A,A,A,A,A? >> Emery Berger: Oh, nothing, that’s right. >>: So that’s the easy answer. So you can fill it in randomly as well. >> Emery Berger: So we have talked --. So there are a bunch of UI issues here. So one of the things that we talked about doing is you can imagine it being, like you know one of those combo locks were you turn each individual number. So you could totally do that and you could basically have some kind of a drop down with the alphabet, but it would be horrible. Nobody would want to do it. People just don’t want to do it. >>: I think your answer here is a single letter, right. >> Emery Berger: It’s a single letter. the single letter. So I agree --. So everybody has to agree with >>: A. >> Emery Berger: A, everybody says A; A is the answer. So I think there is no, the problem here basically is the weakness of the random assumption when it comes to certain kinds of human behavior. So the question, is everybody really going to type A, A, A, A, A? It’s probably more bias then typing A, Q, Z, 1, 4, 3, right, but we don’t know what that is. >>: It could be any random starting letter. >> Emery Berger: It could be any random starting letter. And I will say that with this particular task and because of the long-term financial incentives this is the kind of thing that nobody writes a bot to do. This is exactly the kind of tasks that they are like, “No way, the bot has to interact with JavaScript, no”. They just look for very simple forms and they fill in things at random. Yeah? >>: If you want to find out what people type randomly ask people [inaudible]. They figure out what a German cryptographer doesn’t. >> Emery Berger: When they actually type, yeah. So we have talked about doing stuff like actually --. I mean it’s not research, but it would be easy, you could just measure the bias of how people type things. This is not a problem and in fact there is subsequent work that has nothing to do with this which is about detecting errors in data. We have this project that we call data debugging and of course people make typos in very particular ways, but we kind of don’t care. We can find them without knowing that sort of bias. So I am going to show you a run. So this is a real execution that actually fortuitously exposed all kinds of things that AutoMan does to maintain quality, for pricing and all this stuff. So this is this task where we wanted to automatically cluster children’s characters, children’s show characters. So we took a whole bunch of things, we said, “Which ones don’t belong?” And then you could get like these are in one cluster and these are in another cluster. This is hard from a computer vision perspective because it depends on some other relationship. It’s not about the way the thing looks, it’s related to the program that they were all on. I mean there might be some relationship, but. Okay, so here’s one that popped up. Which one doesn’t belong? >>: Sponge Bob. >> Emery Berger: Sponge Bob, thank you, somebody watches TV. Okay, everybody else is like, “Oh, I don’t watch TV; I am too educated and cultured”. But, I have kids and I definitely know what these are. All right. So what happens? So in this particular case AutoMan has this strategy that by default is trying to optimize for money. So it tries to spend as little as possible and it kind of crosses its fingers and hopes for unanimity on the first round. Okay. So it turns out you need three tasks to get unanimity here, to get the 95 percent threshold. So it goes and spawns them and within two minutes somebody said, “Sponge Bob”. And what’s cool about AutoMan it exposes a socket that you can listen to and you can actually see all the activity. So when you write a program it’s waiting for the function to return, you map this function onto a bunch of images and it goes. Then you open up the socket and you can watch this. It’s like worker twelve has arrived, worker twelve says the response is this and worker twelve’s task was approved and paid this much, it’s pretty funny. So then we waited and just a few seconds later we get another Sponge Bob and then somebody said, “Kermet” and we were like, “Really, Kermet you have got to be kidding me”. And so one of my friends who is Greek said, “Well” because Greek’s are really good at this. He said, “Well, Kermet is the only one who has been on two different shows”. And I was like, “Okay, you win, but I am pretty sure that’s not the line of reasoning that went into this. I believe this person just clicked”. Okay, he was like click, done. So AutoMan is like, “Well, that’s not enough”. So AutoMan then goes and spawns three more tasks. The reason it has to spawn three is now it needs five out of six to get to this 95 percent confidence level. So somebody comes in a few minutes later with Sponge Bob, but it’s not enough because we have spawned all these other tasks. We need the other two to come back. We get this weird delay, there are some time of day issues with Mechanical Turk, like when it’s dark over the America Continent and India or sort of the beginning of the day and the end of the day there is not many people in Samoa working on Mechanical Turk, unfortunately. So it timed out and if it times out what happens? So AutoMan doesn’t know if it timed out because people are sleeping, we could put that in but we don’t, or if people just were like, “This is not enough money”. So it goes and it doubles the amount of money and time and then somebody rather quickly showed up with Sponge Bob and that’s it, we are done. So it returned Sponge Bob for this one and it cost .36 cents and it took awhile. But I think it took awhile for weird reasons, but that’s something else we will talk about in a minute. Yes? >>: So I think this is unsatisfying on a number of levels. One of them is that you have latencies that are very high for tasks that take you I guess about three seconds. >> Emery Berger: Yes, yes. >>: Another one is that there is a degree of unpredictability or [indiscernible], as well as the cost you end up paying, as well as the amount of soul searching you have to go through if you are actually looking at that. It’s a bit too much I would say. >> Emery Berger: So let me respond to that in two ways. So one the 95 --. Actually do I have this room to 12 or is it actually ending? >> Ben Zorn: Yeah, until 12. >> Emery Berger: Because [indiscernible] is running away. All right so there are two things here: The first 95 percent is a pretty high bar. And it may be too high a bar to impose for every single question. So this idea of weakening it so it’s 95 percent or 90 percent, whatever you want for a collection rather than each individual circumstance could allow you to drop stragglers. So right now we are stuck, every one of them is a barrier in a sense and we don’t want that. But what we could do is we could basically adopt this sort of what MapReduce does. These things took a long time to come back, we cut them off. And I believe we can reduce latency in that way, but there is definitely a latency/money trade off which we will talk about in a minute. So we have a way of dealing with that. All right. So let me move on and talk about another example. So we wrote this license plate reader application and you might think, “All right, this is a solve problem”. Things are better then they used to be. People have been working on this problem for three decades. There is an unbelievable amount of literature on what’s called ANPR, Automatic Number Plate Recognition. A few years back Maryland actually deployed one of these systems because they wanted to do a traffic flow analysis. So they had this very cute idea, put cameras on the highway on ramps and off ramps, see which license plate came on and see where it went. Great idea right, because this is a huge problem for civil engineers, they don’t actually know where people go. So the send out surveys and they ask people, “Hey where do you go?” And nobody fills out the surveys or the surveys are unreliable and they say, “Please keep a diary” and it doesn’t really work. So they were like, “We will solve this problem”. So they got some company that promised them something like 98 percent accuracy for reading these license plates, but this is what they got. So they ran their own vehicles through the system and in only twelve percent of the cases were the license plates correctly identified, which is pretty low. So what happened? So you might think this is just a solve problem. So first I should observe that this is people going at a pretty good clip, they are going fast. And these are cameras that are mounted up high so it’s not like if you have driven in the east coast and they have these easy pass or fast lane things where the camera is here, there is a big light and you have to go slow, all of these sorts of situations. There are many other factors; it turns out this is actually a much harder problem then you think. So you all can probably see that plate. This is a non-foggy day in San Francisco, well illuminated and works great. This is a screen shot from the I90 traffic cam in Boston. So this license plate is a little hard to read. So there is this problem of weather, visibility and there are all these other issues. So Massachusetts compared to many states has relatively few plates, but it has quite a number with a lot of noise. And this is the California situation, every state has tons. There are tons and tons, different fonts, different visibilities, so it’s actually fairly complicated. So being that this is such a well studies problem and computer scientists are involved you will be unsurprised to learn that there is a benchmark suite. So these are actually drawn from a benchmark suite of license plates. These are the most difficult license plates for automatic systems to read. >>: [inaudible]. >> Emery Berger: Yeah, but it’s not bad. I mean some of them if you look at this you are like, “That’s pretty straight forward, but this is a little weird. Is it an M or is it a W?” You can see some of the problems, but that’s a CAPTCHA. So that’s a screen shot of a CAPTCHA. What the hell does this say? So it’s actually maybe even easier than CAPTCHA, but it’s getting close. It’s like around the same complexity. So it’s hard for computers to solve this problem. All right. So my student Dan Barowy who was here last year as an intern is the lead student on AutoMan and this is a task that he posted and it even says his name which I told him was a very bad idea, but anyway. So it popped up and he got angry e-mails every now and then. So this goes back to somebody’s question about people complaining about not getting paid. So the amount of money that comes up here, see there’s .06 cents per hit and its 30 seconds. There is your clue that it’s AutoMan, that’s pretty much a guarantee. And there were some number of hits available, it put these things up and you had to go ahead and type in the answer. So when he first did this we actually weren’t aware of the benchmark suite. So we had to populate it with images of cars. And I said, “Dan, go find images of cars”. And Dan found this website called OldParkedCars.com, which is the internet and you are like all right. So there is a website that’s nothing but photographs, you can see, of cars like this is an interesting car I found parked somewhere. So those images he actually combined them with Google Image Search to get more images of cars. Then they had to be scaled to the right size and he de-speckled them and did whatever. So this is all the interaction with Java stuff. So this is because Scala is in the JVM it interacts directly with the stuff. So it was really easy. We posted the images are on Amazon S3 and then there is a whole ton of them and it’s basically two maps. So he mapped the function and one of the functions is: Is this a car? You might think that’s weird, but if you search for cars and you take the images from old parked cars some of them aren’t cars. Some of them are like, “My name is Joe” like it’s a selfie in front of a car. And then he takes pictures of cars. There is a lot of noise. And then the next one you find out it whittles down the list, it then maps a function which is, “What does the plate say?” and it returns strings. So what does this code look like? It’s pretty simple; this is actually more than what you need. It added some stuff which is the default so here is the function “Is_Car”. So Is_Car takes a URL as an input and it’s a RadioButton question which means there is just some fixed number of choices. Here we have a budget for this question. We will never pay more than a buck for this question. You don’t need to do this. It’s not a bad idea. Here’s the confidence, which is a default. By default it’s 95 percent. Here’s the question: Is this a car? Here is an image that pops up and then this is just Scala to represent the string and the actual sort of [indiscernible] value if you will that comes back. So these are the return values. You either get a tick yes or a tick no. All right. That’s actually the whole function for Is_Car and then this is the license plate function. It’s “What does this plate say?” here is the URL. It’s a “free text” question and it has this pattern down here. And the Y means they are optional alpha numeric characters, so more COBOL picture clauses. So we ran this program and the first thing it does is says, “Is there a vehicle?” It spawned these tasks and a number of workers came back. They said yes, they said yes, we got unanimous agreement, good and then it goes and asks, “What does the license plate say?” So it’s a little hard to read the results up here, but this one is actually a Washington plate by random chance and 767JKF. >>: How about JFK. >> Emery Berger: JFK is one of the responses. >>: [inaudible]. >> Emery Berger: Yeah, it’s a typo or maybe just a bias, a weird bias like somehow JFK, but this was enough, two people disagreed. >>: I actually got out of a ticket one time because of that. license plate was FYL and the meter maid typed FLY. My >> Emery Berger: Ah, nice, that’s good. There’s a lesson there for us all. We should all get personalized license plates. [laughter] >>: As close to a real word. [laughter] >> Emery Berger: Exactly, that’s good. So they disagreed, it spawned more tasks, they initially timed out so it spawned them again for .12 cents and then it got this thing back. So there is actually an interesting feature that I wasn’t aware of until Dan pointed out to me that if nobody has taken your hits you can withdraw them. So it actually cancels these outstanding tasks. So when somebody says they accept them they got them, but if nobody has actually said, “We accept them” then it will cancel them. So I was like, “Hmm, I have a really good idea of something we can do”. So I will get to that in a second. So here are some results. We ran this program and this is the suite of these hard images. It turns out they are all from Greek trucks, so Greek trucks are hard to read. So it’s called the LPR database. We used the extremely difficult data set. The accuracy received was --. So it was a 95 percent confidence level and we got 91.6 accuracy. Now how did we measure ground truth? Shockingly this benchmark suite does not include the ground truth. [laughter] >> Emery Berger: No kidding, it’s crazy. I guess nobody has any hope of answering these anyway so nobody has ever tired. So Dan and Charlie, who has also been an intern here, went, sat down with the images and magnified them and then had arguments over which ones were which. So they went through all 144 plates and some were really straight out of a bad Sci-Fi movie it was like zoom, zoom, enhance, enhance, like is a W or is it an M; is it a V or is it a U? Some of them are really hard and people make mistakes on them. This is a lot better than 12 percent and this took us 15 minutes, writing this app. I was like, “Dan go write a license plate thing” and it was done. Okay. It costs .12 cents more or less per plate. The latency was actually pretty low. It was less than two minutes per image. So it really does depend on what kind of task you post, how many hits there are, there are a lot of things that are difficult to account for. All right, but there are some things that we exploit and I am going to explain one of the things that we exploit to reduce latency. So here are a couple of things we are doing. So one of the things I have just described is that we have just three types of questions. There is RadioButton, there is CheckBox and there is this restricted free text. We would actually like to be able to handle arbitrary questions like, “What’s the answer” and “Fill in the text”. There are problems here; part of the issue is you can get answers where there is no way anybody is ever really going to agree. So here is a canonical problem: Translate this paragraph into another language. The odds that people will agree are either vanishing low so you will never get consensus or super, super high because they all use Google translate. So both of them are problems and so what we are trying to do is turn this into a problem where we actually throw back the answers to the Mechanical Turk workers and have them rank them. And we will adopt an approach that basically gives you confidence in the resulting rankings. Then there is this question of optimizing time and money. --. Yeah? >>: How does it work if they all run to the same service? So right now >> Emery Berger: Oh yeah, so all of them will be canonical so if everybody only uses Google Translate then there is only one answer and we will be screwed. So hopefully there is some reasonable person in the audience who is not using Google Translate. But if only one response comes back then there is no ranking and you would have no confidence. It’s sort of like asking one person. So yeah, you canonicalize them all, take all the ones that are different and ask people, “Which one is better?” >>: Going back to the license plates, could Maryland afford to pay .12 cents per scanned license plate? >> Emery Berger: Yeah, that’s a good question. So it turns out there is a commercial system that some police forces use. They have a camera that’s mounted on the roof of their car. They just drive around and it reads license plates. What it does is take pictures of the license plates that are in front of it and sends them off to a giant server, just some cluster and it does some sort of vision processing on it and returns results. It costs 100,000 dollars a year to subscribe to that service per deployment. So it’s non-trivial, so .12 cents a plate, by the way, this is .12 cents a plate and most of that cost actually came from ruling out whether they were cars or not. If you know it’s a car you don’t have to ask anybody. So if you say, “Well, I have got a picture of a car in front of me and it’s not a donkey or a dude standing in the rode” then you can get consensus even cheaper. Okay. So in AutoMan time is money, but as all of you may be aware other people’s time is not worth as much as maybe your time. So a workers time we have said is 7.25 an hour. How many people here get paid more than 7.25 an hour. Oh, I am never working at Microsoft again; one person. So, most people make more money. So let’s assume that your time is worth 200 an hour. This is my Mitt Romney slide. So really, do you really care about being super, super judicious with the expenditure of money? Maybe you care more about latency. So here is this mechanism and what you can do is say, “Look 3 is the bare minimum”, but what AutoMan let’s you do is actually take the ratio between your time and the workers time value. And it will spawn tasks so that in a sense the worst case situation is you do it yourself. So as long as it is cheaper then you doing it yourself it’s a win and you may want it to be faster. So you can actually set some time value multiplier and what it does is spawn way more tasks than it needs, but because Amazon allows you to cancel outstanding tasks once you reach your default threshold you can cancel them. So what this does, well the question is how much do you want to pay? So this is just a dial; it’s a risk. It’s a risk, so you can bound your risk. This is the minimum you pay, the default is the minimum, but if you want to get faster answers you will take a risk of in the worst case spending that multiplier more. But in the best case you will spend exactly the same amount, but there is a bigger pool, there are more hits available, more people will sign up to do the task and you will get the answer faster. And you can think of it this way too, the way that it works right now is you have this barrier. It says, “I create the bare minimum. If everybody agrees we are good. So we can --. If one person disagrees it’s round two." >>: I don’t understand your comment that you will have a “bigger pool” because it seems to me that the latency is in the discovery of the task not the actual completion of the task. >> Emery Berger: Ah, so, maybe. >>: But what are the minutes because you had minutes for those things and those tasks take seconds. >> Emery Berger: Yeah, the tasks actually take some number of seconds that’s true. >>: So unless you sort of had a lot people --. >> Emery Berger: So think about it this way, there are two things you need to think about. So one is suppose that I just post three tasks, I say three because three is the bare minimum. There is a risk that I won’t get all three back correct, they won’t all agree. So the way that it works is those three will get posted, then it will come back to the system and say, “I waited for all three and now I am going to go and do more”. I was super optimistic with respect to the likelihood of consensus. So what I can do is I can dial down my optimism by extending the number of people I am willing to ask. So in a sense what I am doing is I am taking multi-phase barriers, so barrier one, barrier two, who knows how many barriers and doing more work up front in parallel, somewhat speculatively. So it’s a kind of speculative execution in a sense. So I launch these tasks and which ever ones come back faster I just kill the other threads. So it’s a way of reducing latency. >>: Yeah, I was going to say it’s really related to the cost of the time it takes to kill them, right, because the only risk is that in the time between when your task is complete and the time that you kill all the rest that people enter the system and complete them. >> Emery Berger: Yeah, right. >>: And all the tasks can get accepted and not completed. >> Emery Berger: They could, that’s your risk. >>: I think that real risk is the amount of divergence, right. You have a race to kill these things and then more of them jump on these things. Suppose you have 5,000 [indiscernible] and thousands of people can do this thing. And then there will not be any consensus just because of the sheer number of these people. And then [inaudible]. >> Emery Berger: Oh, well that’s not actually how it works so I don’t really understand the confusion. >>: Why won’t you have divergence? >>: Yeah, it seems to me you have a problem. If you have three answers that agree then you would say it’s correct. But if a fourth task was accepted and that guy is going to disagree with you now you need five out of six, which means you are going to have to go back. >> Emery Berger: Sure, sure, sure, sure. >>: So it seems like when do you stop? >> Emery Berger: Yeah, but you are making an assumption. You are making a conditional probability assumption that changes the odds. So you are saying that if this already happened then there would be this greater risk, but you need to look at it differently, right. The question is: If I spawn three tasks what is the likelihood that those three tasks will all agree? Or if I spawn four tasks what’s the likelihood that three of them will agree? >>: But I don’t thin that’s important. I think what’s important is the possibility of this kind of divergence, right. >> Emery Berger: Where people, so the divergence --. >>: You run the risk of spending way too much money just because of this very attractive task. >> Emery Berger: So this business of the multiplier, maybe this is part of the confusion, this doesn’t affect the rate of pay for each individual worker. So the rate of pay remains. They still get say 7.25 per hour. What I am doing is not paying them more; I am just giving more people the opportunity to do the task. >>: And the first three that --. >> Emery Berger: And the first three that finish --. >>: So if they came in and are good we are great. >> Emery Berger: Yeah, that’s right. >>: But I think that’s a bias towards to bot’s and Homer’s right, because they were clicked fast whereas the guys actually looked at the license plate. [inaudible]? >> Emery Berger: Yeah, so that’s an interesting question. So the bot thing I still think --. So you are saying its bias because they were fast. So you can do things and people have done these things, we don’t right now, where they measure what is known as paradata, yes paradata. Paradata is where you actually track response time, you track mouse movements, and you track if they click or hover over things, all of these sorts of things. And so it turns out that people in general don’t move the mouse directly in a perfect path straight to the answer. But of course you can imagine you gaming less. It becomes some weird arms race. You could write a bot that appears to cogitate and it’s hesitant and like, “Ah, how about this one”? So then you are back in CAPTCHA land. So I think that it’s a fool’s errand to try to distinguish people from bot’s in any other way then by using this consensus based approach. >>: [inaudible]. >> So the [indiscernible] would say you can’t actually make the assumptions you want to make if you don’t add this extra bias of [indiscernible]. >> Emery Berger: Um, yeah, I am not sure. I mean of course the bot’s could be answering, they could be sniping all of your tasks. So I don’t really see why it changes the game to have this approach. It doesn’t seem any different to me then the first approach. >>: [inaudible] that’s why. >> Emery Berger: No, but if I have bot’s --. >>: It’s just like when they do [inaudible]. >> Emery Berger: Okay, I see what you are saying, right. >>: You are increasing your population, but you are keeping arbitrary the [inaudible]. >> Emery Berger: Okay, now I see what you are saying. Right, okay, yeah, yeah, yeah. So your point basically is that if I ask a thousand people and if basically have changed the set, the pool of people I ask is it different then just asking three people? It is different, it is different. I think that this is an unfortunate and difficult thing to deal with because if you think about the way statistics normally work, what you are supposed to do, is you set a sample size in advance and you query it. And of course in this situation if we do this then what we are going to do is we are going to end up in many cases not getting an answer or asking too many people. So it’s very difficult to do this adaptively in a way --. >>: There are very well studied ways to statistically --. >> Emery Berger: That’s right, that’s right, so there are something called the Baum, Bom, I can’t even speak right now, the Bonferroni-Holm correction, that’s what it is. And so what you do is you basically have to increase the P values every single time you do it. So we have definitely thought about this. We have not done it yet, but it is an issue. >>: [inaudible]. >> Emery Berger: That’s right, yes. >>: You want to reach 95 percent accuracy; you go one by one and you whenever you reach it. There is some rule that [inaudible]. >> Emery Berger: There is, yes and that’s sort of how it works. The way it works, it is a little more subtle, but it’s kind of like that. >>: But if you do that I don’t see the difference between doing a million and just three. >> Emery Berger: Sure, sure. >>: Sorry, but I don’t see how that address my concern that if the first three agree and then the fourth one comes in [inaudible]. >>: But why would they? agree? If you have four why would the first three >>: No, no, but that’s [inaudible] and if you cut off there you are going to say, “I am done, I have my confidence”, but if you actually considered the fourth that happens to come in then you are now below and you have to wait for two more, right? >>: But that’s a specific event that may happen. probability of working with [inaudible]. And you have the same >>: No, no, but I mean [inaudible]. >>: [inaudible]. >> Emery Berger: So it comes down to this correction, basically it’s a question of sequential testing. So that’s the whole point of doing this correction is that what you do in a sense there is a rejection criteria in the changes. And so as you ask more people you have to reject more and more of the answers to get the same level of confidence. So this is something that we should be doing, but we are not yet doing. It is going to go in. There is something that is actually a more serious problem which I want to be solved in the same framework. So I think this is going to get to your question. So suppose you have dependent tasks? So I keep talking about these things like they are just one function, right. I want to know this one function. So I have got some function F and F is a human implemented function and has a 95 percent confidence level. Let’s assume that we get 95 percent confidence. Now suppose I have a function G and it also has a 95 percent confidence, awesome, until I do this. What if I compose them, right? I say F of G of X? Now I don’t have 95 percent confidence; now I have 90 percent. So what do I do? So I need to take into account the actual flow of the computation and the amount of uncertainty, okay. And we want to do it in a way --. So how can you solve this problem? So apriori what you could do is you could crank these things up really, really high so that you always have 95 percent out at the end. So you would be like 99, 98 and then you are done, right or 99, 98 and 99, which one should you do; flip a coin? No, that’s not really smart. So what we want to do is do this dynamically. So if we discover that F is relatively cheap and relatively fast and G is relatively expensive and relatively slow what we can do to boost confidence is actually spend more on F and less on G. And so now when we multiply .99 x .96 we get .95. So we can do this dynamically and kind of optimize on the fly to get the most bang for our buck in a sense. So that’s a coming attraction and it’s going to fit in with the correction. Yeah, so good, perfect, right at the end. So in some we built this system for programming with people. Okay, the mystery door that keeps opening. It’s fairly easy to program. Once you write these things they behave like ordinary functions in your programming language and you can do what you want with them. Nobody actually asked this question, but they are of course implemented as futures. So it’s not like everything blocks when you invoke a function. They are all really designed for you to run map on a bunch of stuff and then only when you need it does it actually go and block and everything get’s spawned in the background. It takes advantage of the amazing embarrassing parallelism of people, which is great. So we built this thing, it solves all these problems, especially the Homer and Bender problem and you can download it. It’s on a [inaudible] and it’s on this web page automan-lang.org. I know it’s in Scala and one of the next things we are going to do is put Java wrappers around it. We are thinking about Python. I am not sure we are going to handle C++, although Dan is really a big fan of F++ right now, so you never know. Anyway, all right, thanks’ for your attention. [clapping] >>: Very good talk. >> Emery Berger: Oh, thanks.