>> Ben Zorn: Okay, so welcome everyone. It’s a... introduce Emery Berger. Emery is an old friend of...

advertisement
>> Ben Zorn: Okay, so welcome everyone. It’s a great pleasure to
introduce Emery Berger. Emery is an old friend of Microsoft Research
he started here as an intern, say more than 10 years ago, that’s all
I’ll say.
[laughter]
He is currently an associate professor at the University of
Massachusetts Amherst. Emery has done great work over the past in many
different areas, systems and programming languages. There are always
very clever and interesting ideas. They are also very useful, so the
Parallel Memory Allocator is still the best most wildly used sort of
parallel allocator you can get, which is great.
>>: [inaudible].
>> Ben Zorn: Yeah, a paid announcement.
[laughter]
>> Ben Zorn: But also Emery has received a lot of honors over the
years. A Microsoft Research Fellowship, NSF CAREER Award, a Lilly
Teaching Fellowship, a Most Influential Paper Award recently at OOPSLA
2012 based in intern work he did here 10 years earlier which is
wonderful, a Google Research Award, and a Microsoft SEIF Award. So
really, highly acclaimed and it’s a great pleasure to have Emery here
today to talk about AutoMan, another really interesting idea.
>> Emery Berger: All right.
later, your kickback.
Thanks Ben and you will get your pay
All right, let me jump right in. Let me add up front I mean this is a
nice intimate setting and I know that most people at Microsoft don’t
feel inhibited from asking questions anyway, but let me encourage you
to jump in.
So computation has been a real success story. Obviously we have been
able to do amazing things with digital computation, right. We have
built these algorithms that allowed us to decode the human genome. We
can do these incredibly high performance simulations of things like air
flow over, you know, my next car. We can do scientific, blood flow
simulations with very high precision and very high speed.
You know, these games really are doing 3D physics live, which is
incredible. So computers are really, everyone knows in this room,
computers are amazing. We can do all this great stuff, but there are
things that are hard for computers to do. So here is an example of
something that’s actually relatively difficult for a computer to do.
So is this a giraffe?
>>: [inaudible].
>> Emery Berger: Thank you. We have a great fan in the audience. So
it turns out that the Java standard library doesn’t actually have this
built in, which is a rare exception. So what do you do if you want to
actually --? So, you have an app and you have some application, you
are going on a safari and you want to be able to recognize all these
animals? Okay great, so I write an app, but then I am stuck. This
function is a really, really hard function to implement. So what can I
do?
Yes?
>>: [inaudible].
>> Emery Berger: Well it depends, let’s test. How many people think
that picture is a giraffe? Wow so it is harder, it’s really hard, only
3 Microsoft engineers can actually this question.
So in principal you can ask a bunch of people, right. I think it’s
easier for people in general to recognize this then to write a computer
program to do it. So you can ask these people and of course there is
this thing that most of you have probably heard of called Mechanical
Turk. So how many of you have heard of Amazons Mechanical Turk, raise
your hand. Okay, so there are a couple of hands that didn’t go up so I
will explain briefly.
So Amazon is this company in Seattle that you all should be aware of.
It delivers books; here it delivers groceries which is hilarious. But
it also has in addition to a number of other things they do like
running the bottom third of the web they also have this thing called
Mechanical Turk. And Mechanical Turk is this marketplace for work. So
what you can do is you can post things and there will be people out
there who will go and do them. So you can make money yourself. So you
can sign up to Mechanical Turk and it says you can work on hits. So a
hit is a Human Intelligence Task, it’s just some thing that may be hard
for a computer to do that you farm out to people.
So you can get paid, not much as we will get into, but you can get
paid. And of course you can become a sort of employer. You can get
results from Mechanical Turk. So how many of you have used Mechanical
Turk? Oh, far fewer, so you all should use it. Not right now, but you
all should use it. So you just go to mturk.com, you sign up and you
should actually take a look at what’s there and you are not going to
get rich by doing any of these tasks, but it’s quite interesting.
So here is an example hit that we pulled down. So this is a real hit,
that is to say human intelligence task and a hit generally consists of
a number of parts. It’s got text which is the actual question; it’s
got a number of question choices. So here it says please enter the
name of the product being advertised. You have got a picture of a
couple of hipsters and it say’s Levi’s. So it’s no necessarily clear
what’s being advertised. But it says, “Please enter the name of the
product being advertised”. This is not an add, the image failed to
load; the product or company name is not known and crucially contains
adult content.
So it turns out that in fact Mechanical Turk is used for a lot of this
kind of filtering, because it’s relatively difficult for computers to
decide whether something contains adult content or now. So in addition
to the question there is all this stuff at the top. So here it says
that you get 10 --.
Yeah, Mike?
>>: So you are saying that’s actually what, the person who created that
task, that’s really what they wanted to know?
>> Emery Berger: So actually I think, no. So actually if you look
right here it says Classify Add Images. So we don’t know from this who
it is, but I can tell you there are a number of sites that have
shopping ability like search portals that in fact use this for
filtering.
So they use it in a number of ways. So for example it went and it
pulled up this image next to some object like jeans. So it said
somebody went to go shopping for jeans, it pulled up this image, is
this really an image of the jeans? That’s important to know because if
they put up the image and it’s useless it’s not very helpful. And if
it’s the wrong image and if it’s a really, really wrong image that’s
even worse. So I think they really do want to know all of these
things.
So getting back to this, here any task that you get has a timer
associated with it. So when you go there are all these tasks that are
available, they are posted and you can go and say, “I am going to do
this task”. And then you get some amount of time for which the task is
yours. And that’s how much time you get to work on it. So here you
get 10 minutes. For each one you decide if you want to accept it or
not. So it’s completely optional. Then here, this is how much money
you get. You get .06 cents which doesn’t seem like much and it’s
certainly not much if you took 10 minutes. On the other hand, this
really should take you far less than that and you could burn through
quite a few of them.
Also, it’s important to note that a good fraction of Amazon’s
Mechanical Turk workforce is in the Indian subcontinent, more than a
third. So the wage differential is substantial. And one other thing
is people just do this for supplementary income. So it’s better than
Minesweeper I would say.
Okay, so these are all the pieces you need to know. You can see a
couple other things here. There is some crazy qualification stuff that
I will get into a little bit more. And there are some US English
things so you would need to be an English speaker. So you can filter
out kind of for demographics as well.
So how many of you know why this is called the Mechanical Turk? All
right a few of you. It’s a really great story. So this is the
original Mechanical Turk. The original Mechanical Turk was this
Victorian era chess playing champion. It’s a machine that defeated
chess masters around the continent. It eventually came to the United
States and was defeating chess masters there. And there is something
wrong with this, right. This seems a little confusing, like Victorian
era, chess playing computer, it seems a little weird.
And you would open it up and there were actually like gears and there
was steam coming out and it looked very impressive. But, it turns out
that behind that panel was a person inside. So there was actually a
very skilled chess player who hid inside the machine and drove this
thing like a puppet. It was actually quite large and then he had –-.
So when the chess pieces were pushed down there was this complicated
system of levers and he would see what was being played and then he
actually had his own little chess board here that he would play and
then he would move the thing to make the next move. So there you go.
So the joke with Amazon is that it looks like a computer, but really
there are people inside. So there is actually an API that let’s you
access Amazons Mechanical Turk and of course it’s really people.
So you think, “Okay, done, this is how we solve the problem”,
isGiraffe? You go and post it on Mechanical Turk. I should add that
for those of you who have not actually tried to post something on
Mechanical Turk it’s actually a huge pain in the ass, because what you
do is you go and create the web form manually. You put this stuff up
and then you say how much time you want to allot, what you want to do,
fire it off, then you wait, eventually something happens and then you
get a list of sequences of responses and then you can kind of download
them into a CSV and then you have to manually approve them or not. So
it’s a real pain.
All right. But, great we can do it right. We can answer this problem.
So what do you need to know? So if we are going to go ahead and we are
going to pay all these people we need to figure out how much we are
going to pay them. So is .06 cents worth it for this computation?
It’s sort of like, maybe .06 cents. If you pay people too little they
won’t do the task; if you pay them too much then you wasted money. In
addition you need to decide how much time the task should take.
So we can all agree that isGiraffe shouldn’t take much time, but there
could be maybe [indiscernible] example there is some really hard to
figure out image of a giraffe. It may be there is a giraffe and it’s
hidden very carefully behind leaves. So that might be something that
would take longer. And then finally there is something really bad that
could happen. So we handled money and time, those are complications.
What’s the real big problem here?
>>: Latency?
>> Emery Berger: Ah, so latency is an issue for sure, but that’s not
the biggest problem.
>> Correctness.
>> Emery Berger: Correctness, yeah. So what if they say no? Okay, so
the whole point in outsourcing this is that you are not going to check
it, because if you were going to check it you could have just done it
yourself. And in fact what you want to do is you want to use this in
particular for problems that are huge where there are thousands or
millions of things that you want done, not just one thing. And
certainly you don’t want to involve yourself in the loop and now this
person get’s it wrong. So how could this happen?
So it could happen for a variety of reasons. The person could just be
stupid. The person is like, “Well that’s not a giraffe, giraffe’s are
you know, those big grey things with tusks”, right just a moron. So
this is the Homer problem.
This is a more serious concern. So if you take the internet and you
add money you get bots, right, it’s like a formula. So basically
people have written bots that just go and take tasks of Mechanical Turk
and answer them. And the idea is that they are going to make money.
Now of course what people have not done is said, “Oh you have a task on
figuring out adult content. I will write an AI that can figure out
adult content and then I will solve all these things”. That would be
awesome actually, right. But that’s not what’s happening. What’s
really happening?
>>: Random.
>> Emery Berger: Random, right. So what they are doing is they are
effectively flipping coins. So they have a bunch of tasks and they
pick something at random. So that’s clearly a concern. Now there is
one other way that this could go south. So I mentioned morons, the
Homer problem and the bot problem. What else could happen?
>>: Ambiguity.
>> Emery Berger: Ah, so ambiguity is an issue, that’s a good point and
let’s table that for now, but you are right that is an issue. So let’s
say there is a more serious potential concern.
>>: Nobody takes a hit.
>> Emery Berger: That is a concern, but luckily it turns out that if
you pay people enough people will do almost anything, so yay for
capitalism.
So here is the problem. We call this the evil genus problem. So the
evil genus problem this is sort of like a Byzantine fault tolerance
situation. The evil genus problem knows what the answer is, but always
picks the wrong answer. So this is somebody who is just out to screw
you for some reason. So this is the kind of thing that’s typically sort
of like this is the way we construct the world in academia. We say,
“Well let’s pick the worst possible situation” and then we do this
Byzantine fault tolerance thing. I am going to argue that this doesn’t
really happen for a variety of good reasons. So I am going to explain
why.
What we are going to use is something not as strong as Byzantine fault
tolerance, which actually makes is impossible for you to make almost
any progress on this problem. We adopt what we call a random adversary
model. So the adversary we care about is the one who is clicking on
things randomly. That is to say that we care about protecting
ourselves against Homer and Bender, this is being recorded so I am
expecting a cease and desist order anytime soon. But we don’t actually
care about protecting ourselves against Dr. Evil.
So why can I say that we can do this? So here are the reasons why.
These are features of the Amazon Marketplace that make this a
reasonable assumption. So first I didn’t show you in much detail any
of these qualifications, but almost every task comes with these
qualification requirements attached. And a lot of the qualification
requirements say things like, “Your total approved hits are greater
than 1,000". That is you have done 1,000 tasks and they have all been
approved, you got them right or there is some long-term hit approval
rate. So it says that your approval rate is greater than 95 percent.
That means 95 percent of the tasks you agreed to do were approved.
And then there are other things like, location is in the US and so on,
but this sort of thing is pretty typical. That is there is a long-term
financial incentive to stay in the system and do things well. And it
keeps track of your overall percentage and once your percentage drops
below a certain threshold game over. You won’t be able to get any work
anymore. So you might thing, “Well this is –“.
Yeah, [indiscernible] go.
>>: [inaudible].
>> Emery Berger: Say it again please.
>>: The approval is done by humans on the other side?
>> Emery Berger: Yeah, so there are a number of ways approval can be
done. You as the person who is outsourcing stuff are responsible for
approving things. So you can do through and approve things and there
are number of ways you can do it; not many of them scale.
Yes?
>>: So I mean isn’t that the problem you said originally is that you
have to look at the answer yourself; so how is that actually done?
>> Emery Berger: So the standard way that people do it right now is
they do sampling and they do what’s called gold seeding. So gold
seeding is where you have a known answer and then you send out the
question to everyone anyway and you verify that they actually answered
it correctly.
>>: And so with all the other answers would you just say you approve
them automatically?
>> Emery Berger: Yeah, so in fact the way it works in Mechanical Turk
is I believe there is a 48 hour timeout if you do nothing. The task
get’s approved, but you can also manually approve them. And the API
allows you to programmatically approve things. So a pretty typical
strategy is to actually require people to pass some sort of a test
first before they can sign on. Then they get a qualification which is
some giant MD5 hash which get’s added to their record and then they can
do your tasks.
Yeah?
>>: Do you still pay if you reject an answer?
>> Emery Berger: It’s up to you; AutoMan does not, but that’s a policy
question.
>>: But I mean what’s to stop people from rejecting everything?
>> Emery Berger: Yeah, that’s a great question. So it turns out, this
is a really interesting situation. So Mechanical Turk doesn’t really
police that side. So Amazon doesn’t go and police that people are not
being evil exploited employers. So in a sense what they have is they
have qualifications for employees, but not for employers. So it’s kind
of an asymmetric situation. So what they do, what happens with these
situations in the real world? So in the real world there is sort of
maybe course grade that this is a good company to work for and this is
a bad company to work for. There is this kind of reputation that
happens, but there is also this thing called unions. And the unions
get together and fight for people.
So there is a de facto union called Turker Nation and there is a
website and they keep track of reputation and complaints about
different employers. They even have a plug in for your browser.
when you hover the mouse over the employer it comes up with the
reputation score, which is pretty cool.
So
Okay, Ben?
>>: So if you reject it right, after awhile someone will start
complaining fairly vocally as well.
>> Emery Berger: Yeah, so there are number of things you can do. So
it’s complicated because the price that you pay for a task that’s
completed is fixed and must be announced up front. And what you would
like to be able to do is actually adjust the price. Not maybe pay
somebody zero and say, “Well you put in some effort, but you didn’t get
it right” or just something to shut them up to be honest. And that’s a
bit challenging, but I will say --. Well let’s talk about this a
little bit later because we actually get to this.
So this whole discussion is about why you don’t really have to do about
evil doers on the system sitting there, lurking and waiting to answer
your tasks incorrectly. The other reason is, so you might think of
this sort of reputation score. How do we gain reputation scores?
>>: [inaudible].
>> Emery Berger: Exactly, so this is known as a Sybil attack. You
create multiple entities and you just create, like I will create a
bunch of tasks for me and by golly I did them all right. My approval
score goes way up and I do tons of tasks. So it turns out that
conveniently and this is actually really crucial, the way that Amazon
actually grants accounts is they are tied to your financial
credentials.
And in this post 911 world it turns out it’s actually quite difficult
to create new financial credentials. The financial credentials are
tied to your identity. You can’t get financial credentials without an
identity. They know who you are and there is only one of you allowed
to have a particular account. So you could open 10 bank accounts, but
they are all you and they know that it’s you. So this makes it very,
very difficult to initiate a Sybil attack.
>>: I don’t understand. And Amazon might know that there are 10 ID’s
that are you, but Turker Nation doesn’t.
>> Emery Berger: No, that’s true, but you can’t --. So Amazon is the
one that tracks your score. So the idea is that we are going to use a
Sybil attack to create an account, ike I am the evil genus. How will I
be able to get work?
>>: Oh, you are the worker not the employer.
>> Emery Berger: Yeah, that’s right, that’s right. So I want to be
able to somehow get in past all these qualifications which having to
actually have been a good employee. So really what we are left with is
you actually have to be a good employee and then kind of go rogue at
some point. So you have to have done 1,000 tasks well and then
suddenly be like, “Now I am evil; the mask is off”. And then your
qualification rate will quickly drop and then game over.
All right.
So fine, yes?
>>: This seems to be a similar problem to what I saw. No matter what
the bad guys do they have to --. The [indiscernible].
>> Emery Berger: So I am not really familiar with this particular
thing. Perhaps you can give me some more details offline.
All right, cool. So now we are in the situation as I said there are no
evil geniuses, but we still have Homer and we have Bender. Homer is
the moron; Bender is the bot who is acting randomly. So now we are
still like, “What do we do?” We ask one person, that’s maybe not
enough. So what else could you do?
>>: Ask two.
>> Emery Berger: Ask two, right. So this is a pretty generic and
obvious thing to do, right, ask two people. Ask three, oh okay. So
how many people should we ask? Do I hear four, do I hear five? All
right, so you won.
Yeah?
>>: So how do the random ones get higher then just 95 percent approval
ratings? That seems --.
>> Emery Berger: Yeah, so that’s an interesting question. There is
this sort of the fact that approvals happen in a kind of sampling based
way. It means that it is possible for you to get through. It’s an
excellent question. It’s basically based on the inability of the
employer to screen all of the answers, because at scale you just can’t
do it so they use sampling.
>>: So if you sample less than 5 percent your approval ratings
automatically go higher.
>> Emery Berger: Yeah, that’s right, certainly.
people do is sort of --.
So one thing that
>>: Another thing that will happen is less then 95 percent.
>> Emery Berger: No, that’s not really what happens. But that being
said there are definitely random actors in the system, but here is the
problem, once you get caught then you are done. So you are still you,
you are still Manuel [indiscernible]. You can’t be like, “I can never
work again, because –“.
>>: But I don’t understand how I get caught. Even if I answer some of
the answers wrong if the controls are not checked my approval ratings
are still above 95 percent.
>> Emery Berger: So it depends on the frequency of checks, you are
right.
>>: Clearly you could put more scrutiny on the ones that you found in
the sample that were wrong. So it’s like a two tier thing.
>>: Right, but unless Amazon helps with that --.
>> Emery Berger: Yeah, yeah, that’s right. The issue of Amazon helping
has been one that many people broached. Like wouldn’t it be nice if
you allowed us to do a little bit more? And so they very grudgingly
incorporated new features into the system. So one thing they recently
rolled out is they have a sort of selected crowd that has been more
vetted and so those guys, you have to pay more to use that crowd and
then nominally there is a quality guarantee.
No I should add the problem here is not just --. So there is this
aggregated problem. There is this whole worker pool and there is
people acting randomly. And maybe I will get lucky and I will get good
ones, but for my computation I actually want the correct answer. I
really do want to know if it’s a giraffe. I don’t want to be like,
“Um, maybe it’s a giraffe” and hopefully there will be good people who
answer. That’s where we are right now. So if you adopt this kind of
random adversary model and you look at this redundancy you see that
this is not enough; this doesn’t work.
So let me give you an example: Suppose these guys are flipping coins.
What are the odds that they will agree randomly? It’s 50 percent? So
you could have flipped the coin yourself. If you have an isGiraffe you
could have flipped the coin yourself. You didn’t have to hire two
people. So you officially wasted two peoples worth of employ and you
got nothing. So that sucks; that’s not very useful. But it turns out
if you think about it this is this terrible probability of agreement
because we flipped two coins. But if we flip six coins the chances of
them agreeing at random suddenly are very, very low.
So if you view this through this prism of our adversary is random what
can we do to minimize the chance that there agreement is due to just
random probability? Then we can adopt this kind of approach. So in
general if you have unanimous agreement it’s obvious what the formula
is. It’s K x 1/k to the N where K is the number of choices. So the
worst case is where there are only two choices and if it’s unanimous
then you have a smaller number. It turns out that this generalizes and
you can say you have some number, even less than a plurality and you
can still determine or rule out the likelihood that something random
happened due to random chance.
I am not going to show you formulas here. This is actually unanimity
and the number of choices. And this is just to show you that when you
have a larger number of choices like six choices, if you ask two people
the chance of them agreeing at random is quite low.
>>: That’s only if it’s random right.
the first choice?
>> Emery Berger: Yeah, so that’s good.
slides, excellent observation.
What if people are just choosing
We will get there in two
All right. So we built this system that’s based on this principal of
assuming that we have random adversaries and using kind of consensus to
drive the correctness, to actually determine quality. It’s written on
top of Scala which is actually both good and bad. It’s bad because
Scala syntax is sort of arbitrary. You can do all these amazing things
with it. It’s good because you can do all these amazing things with
is. It’s really a substrate for building domain specific languages.
One of the best things with about it is that it interacts with Java,
which seems weird coming from me to say, “Like, best thing is it works
with Java”.
But it means there is a ton and ton of code out there that you might
want to do for an app for example. And yet there is this human piece
that you would like to use. So what we did is we built this DSL that
allows you to write functions, just a function and it actually is
implemented by people. And then that function you can use anywhere in
your program like it was an ordinary digital function. So the weird
things about using AutoMan, here are the strange things.
So first you have to pay people. So now you have to think about the
total amount of dollars you use for your computation, which you
normally don’t think about. The way Amazon works is it’s sort of like
old school video games where you put in quarters and you get credit.
So you can kind of refill your computation up to a certain point. In
fact if it runs out it throws an exception. It’s an out of money
exception.
[laughter]
>> Emery Berger: Yeah, so this is kind of a different model. That
said, it’s sort of like, so I don’t see, maybe Jim has experienced this
on super computers. So if you have ever submitted jobs to a super
computer system it’s exactly like this. You have an account associated
with and some amount of money is taken away. And I assume that app
posting platforms like Azure and things like that have a similar sort
of model.
All right, so there’s that. The other is now your functions, whenever
you ask people, people are going to be imperfect. You are never going
to get 100 percent assurance that this is correct. So now we have a
confidence level associated with functions. So the default is 95
percent, the gold standard for social sciences. But anyway, you can
set this to be whatever you want. What this means is the probability
that the result that you obtained is due to random chance is less than
.05. And you can dial this up or down. So you can kind of trade off
quality and money directly.
Okay.
So I mentioned there are these things that AutoMan has to do --.
>>: Is that trade off understandable?
>> Emery Berger: It’s understandable in one sense, but I think that
it’s understandable actually in two dimensions. So I don’t know for
sure how much money I will end up having to pay, but if I get a smaller
number of people and they reach census quickly then I can say, “Well I
have reached a lower threshold of confidence”. So the way that AutoMan
actually works --.
>>: But I can’t predict what it would cost for me to raise it to a 99
percent confidence.
>> Emery Berger: You can predict what it would be at minimum, but not
at maximum. That said, I didn’t mention this, but there is confidence
level overall, but you can also have a confidence level for the like
individual invocation of a function. So you can say, “All right, there
are a million animal classification tasks and I normally want to pay
them as little as possible, but I definitely want to pay no more than a
dollar for any of them”. So you can do that. So there is the overall
computation and then there is per function as well.
Manuel?
>>: Is there a minimization of this function?
>> Emery Berger: Yes, yeah.
database, so it’s stored.
All the results go into I think is a MySQL
>>: [inaudible].
>> Emery Berger: Yes, that’s right; they are pure functions, exactly.
>>: So if I understand this, the function is if you have this overall
confidence that means that some invocations of the function could have
very low confidence and other’s high confidence, is that right or is
this 95 percent per result?
>> Emery Berger: Yeah, okay, so let’s see. So if I have a function and
I establish a confidence level and the confidence level is actually
over every invocation of the function. That’s the way AutoMan works
right now. So I have got a function isGiraffe and every single result
I have a certain P value associated with it. And then there is this
money thing that let’s you control how much you will spend on any
invocation.
>>: Okay, I see.
It can be that there are sort of semi-ambiguous situations where there
is not very quick agreement, but there is a bias. And that means that
AutoMan will have to ask more people to sort of figure out with high
probability what the right answer is and you can cap that. But yeah,
the way AutoMan works right now is you cannot say you want 95percent
confidence in 1,000 tasks and then the individual ones some of them
could be zero. We want to do that, but it’s no implemented yet.
>>: So what does it mean the 95 percent? It means that if they answer
at random I would accept the result with this probability?
>> Emery Berger: You would accept the result with less than 5 percent
chance. So it’s one minus that.
>>: So if this looks like a giraffe, but is not a giraffe? I mean if
there are some confusing, like maybe it looks like a lion, but it’s a
giraffe and there are these two --.
>> Emery Berger: That’s an awesome lion, but okay.
[laughter]
>>: So there are these two people who give me this consistent lion
answer.
>> Emery Berger: Ah, yeah, so this is a good question. So the question
that you are asking really is: Is what people say really true? All
right. So we adopt the Steven Colbert version of reality, which we
call “truthiness”. So if everybody agrees it is true then it’s true.
There is really nothing else we can do. So my example of this is, so
you know the expensive car that has this logo, it looks like this,
comes from Germany, the brand is Mercedes Benz, who is the
manufacturer? So I kind of tricked you here already or I lead in that
this is a trick question. The manufacturer is not Mercedes Benz,
because Mercedes Benz is not a company. But if you as people at random
they will almost certainly tell you that it’s manufactured by Mercedes
Benz, but it’s actually Daimer.
>>: Right, but if like a third of the world would say that it’s a lion
and two-thirds would say it’s a giraffe, what’s the confidence level
it’s a giraffe?
>> Emery Berger: So if you assume that the sampling is uniform and
random then you will --.
>>: It’s easy.
>> Emery Berger: Yeah, then it’s easy, that’s right.
>>: So that’s how it’s done.
>> Emery Berger: Yeah, yeah.
Okay. All right. So I mentioned in addition to correctness, which I
have been focusing on, there are these other issues. These issues are
actually fairly easily dispensed with somewhat surprisingly. So first
what we do, it’s not yet happened, but it’s almost certain to happen
that when you contract people out through Amazon you are going to be
considered an employer. And at some point minimum wage laws are going
to kick in. So just to stay on the right side of the law the default,
which you can change, is that AutoMan pays US minimum wage.
All right. So that’s how much we pay, the question is still how much
time to allot to tasks and do we pay more?
Manuel?
>>: I mean the task shows 10 minutes .06 cents; that can add up.
>> Emery Berger: No, it does, that wasn’t mine.
okay.
We didn’t post that,
So what we would do, this is the way it works in AutoMan, every task
starts out as 30 seconds and time is money. So we pay a wage and the
wage right now is minimum wage. So it would be .06 cents for a 30
seconds task, which is the currently desultory US minimum wage. This
is the actual minimum wage, Federal, 7.25 divided by 120.
>>: It’s 10.00 in Washington.
>> Emery Berger: Yeah, states can have it be higher, but the Federal is
the minimum.
>>: [inaudible].
>> Emery Berger: You know, that’s --. It’s like you know how it’s your
responsibility to pay state taxes for stuff you buy online, right.
>>: [inaudible].
[laughter]
>>: [indiscernible].
>> Emery Berger: Well, so there is --. Well, only because of the state
tax is that what you are referring to? I mean obviously you can pay
more. You can pay more or you can pay less, this is just a parameter
to the system. So you can actually say wage equals something.
>>: I think more seriously you can sort of create pressure, right. So
you can basically do [inaudible], right. You can have 20 second tasks,
25 second tasks, etc.
>> Emery Berger: That’s right, so what happens actually, so this is in
fact --. Oh, nice, that’s weird, stupid MAC. Yeah, blame the Microsoft
people who work on MAC products.
All right. So anyway here is the way a task starts off and what could
happen? So nobody could show up to do the task or not enough people
show up to do the task. What does that mean? It could mean two
things. It could mean one the task is just too short a time allotment.
Like people tried to do the tasks, but it took longer than 30 seconds.
The other is people looked at the amount of pay and said, “Screw this”,
right. So there is no real way for AutoMan to know so what it does is
it actually doubles both. That is to say it just doubles the amount of
time, but the wage rate stays the same.
So I argue that this actually prevents gaming even if you know that
AutoMan is using this strategy, which may not be obvious so I will walk
through an intuitive example.
>>: I am not sure what [inaudible] behind that was. Much of the time
they don’t buy these because the task description is boring.
>> Emery Berger: Oh, yeah.
>>: And then what happens is that you double the time and eventually it
looks like it is way too lengthy and boring.
>> Emery Berger: So let me tell you there is some weird phenomenon. So
first I believe that if you pay people enough they will do almost
anything. So I think that a task that’s really, really boring if you
offer them a buck to do something that’s really simple and is going to
take them 10 seconds they will probably do it. So as the pay goes up
people definitely are more attracted. The other thing is that this has
this weird side effect on Amazon’s Mechanical Turk because Mechanical
Turk just has a strange user interface where freshly posted tasks are
up at the top. So if you re-post a task at a higher rate not only is
it worth more it also appears at the top. So it has a kind of nice --.
That’s not something that we rely on, but it’s a happy circumstance.
All right, so here’s the gaming situation. So suppose you are like,
“Hey, AutoMan doubles things, I am going to become rich at Emery’s
expense”. So how am I going to do this? So I am going to see money
coming in for some reason denominated in euros. So I see this money
that comes and I think, “Well it’s 5 euros, but I will just wait” and
the task goes away, get’s re-posted and now it’s 10 euros. So it’s
just a waiting game, right. I just need to wait until it becomes
larger, and larger, and larger and at some point it becomes a giant
stack of money. And then I jump in and grab the task.
So what’s the problem with this strategy? It’s people, stupid other
people. There is always this problem that if the marketplace doesn’t
have just one person waiting there might be somebody else and that
person has some probability of seeing that 20 euro bill and taking it
away and making the dreams vanish of the giant stack of money. So what
really happens is that it’s complicated; it’s not just I need to wait a
certain amount of time. I need to take into account the likelihood
that somebody is going to snipe the task from me. They are going to
take it away and do it.
So this is just a little, very, very simple bit of math. What is your
expected gain? So your expected gain is the base rate of pay and then
every rounded doubles. So the multiplier is two and round one, round
two and round three it get’s two times, four times, eight times larger,
but there is this probability that the task is still available after
all these rounds. So we model it as if it’s a fixed value. You could
be more complicated, but this is sufficient. So let’s say that the P
available is a half. I have 50/50 odds that every time a task is
posted somebody else will do it before I get to do it. And then the
multiplier is two and through the power of math those cancel. So your
expected gain is fixed, which means there is actually no incentive to
ever wait.
>>: Well, but you took that out of your half right?
>> Emery Berger: Well half turns out to be the maximum likelihood
estimator for this. So actually that’s what you would do if you have
no information.
>>: It’s probably even slower.
>> Emery Berger: Yeah, so if the likelihood that it’s available, if
it’s one there is nobody else in the system, right. The likelihood
that it’s available in fact is likely to drop as the value of the task
goes up, which means that this would even be a negative incentive.
Okay. So good, we have got the money and the time thing taken care of.
It goes, it asks people and it obtains confidence. If it doesn’t get
the confidence on the first round it goes and asks more people and so
on until it achieves the desired confidence and then returns a value.
So everybody should be on board with this. Let me just observe that
this random adversary thing has some weird effects. One of them is
that you no longer even need a majority. It’s not necessary or
sufficient to achieve your desired consensus, right.
So, if I have a bunch of coins, let’s say they are dice, so they are
six sided the chance of you getting random agreement of a large enough
subset actually get’s to be pretty small pretty fast. This is the
number of people involved, this is the number of choices, so here is a
die and as you add more and more people the number that you need to
reach 95 percent confidence which is this access here get’s lower and
lower and eventually drops below the 50 percent line. So it’s just an
interesting --. You can’t think of this really usefully in terms of
how many people, majority and all of this stuff, really the random
adversary model is crucial.
Okay. So this goes to [indiscernible] question. So here is an example
kind of task that you can do. So this is what we call a RadioButton
task. You can choose one of K answers as the actual correct answer.
So how many giraffes are in this picture; none, more than one and one?
Okay, so the K is just three here. So what’s wrong with this approach?
So put yourself, if you will, in Homer and Bender’s shoes. What would
you do if you were a bot? Which one would you click, maybe the first
one?
So there is a problem here that Homer is lazy. So random is actually
good from our perspective. What’s bad is suppose everybody clicks the
first one. And then they will all agree and we are like, “Oh, crap”
because from AutoMan’s perspective it’s like, “Yay, instant agreement,
the answer is none, no giraffes”. So that’s not good. So we solve
this trivially by randomizing all the answers. So anybody who sees any
posting sees a differently shuffled version of the answers. So if they
behave randomly, if they behave in a fixed way I should say, with a
bias that’s not actually bias towards the correct answer then they will
look like noise which is good for us.
Todd?
>>: So did you see this in practice? I mean this clearly is a nice
feature, but you must have seen this in practice if you actually added
it to the system.
>> Emery Berger: Well, we actually built the system in a very principal
way to start out with and then we deployed the system. Then the
question really is did we ever observe people who were acting randomly
and I think we did.
>>: Did you see it any other way though?
click the first button more often?
In the sense that people
>> Emery Berger: Yeah, so people --. I don’t remember if we actually
went through and looked at that, but we definitely got people answering
things in a way that made us highly suspicious that they were doing
anything but just clicking something.
>>: Well clicking something is not going to [inaudible].
>> Emery Berger: So if everybody clicks something randomly then we will
never, probably, unless we ask a ton of people get to any high level of
confidence. But again, if everybody in the marketplace is just acting
randomly then the marketplace has no value.
>>: So does Amazon know whether there are bots and do they do anything
about it?
>> Emery Berger: So certainly they are aware because people have
pointed it out to them. I believe that the bots that are used are
mostly for tasks that are highly repetitive and difficult to verify.
So that is for example --.
>>: So the [inaudible] choose who will deploy the robots.
>> Emery Berger: Yeah, that’s right. So what they do is they look for
a task where there are like thousands, and thousands and thousands of
them posted. So the odds of them are being checked are low, so
scrutiny is unlikely. Then ones that don’t have fill in the blank or
any other sort of structure so that they can deploy the thing randomly.
>>: What do you mean they see thousands of tasks?
individual?
Aren’t all the tasks
>> Emery Berger: No, so what they do, I think I showed this before, but
maybe not. So it says .06 cents per hit and then there are a sum
number of hits that are available. So what workers really like to do
and real human workers as well really like to do this, they like to
sort of learn how to do a task and then do that task over and over
again. They love that, I mean it’s something that everybody has noted
and it really is true. They kind of get trained and they amortize the
training over doing many of them.
>>: So randomizing the order doesn’t seem to help in like that case.
If all of them have none as one of the options it would be two lines of
code had it always picked the same answer.
>> Emery Berger: Yeah, right. So the question is if I wrote a bot and
actually looked at the text in the answers and then it’s picked some
bias on the answers it certainly doable, certainly if they knew there
was some sort of structure here. A: nobody does it, luckily. B: there
is no good general way to solve this. The only way to solve it --.
>>: Multiple trace captures.
>> Emery Berger: Yeah, so, I am not sure about multiple trace captures,
but you can buy people, you can buy bulk rate capture solvers at
ridiculously cheap rates. So there are people in Vietnam who will
solve 10,000 captures for you for like 10 dollars. So they are just
like, bing, bing, bing, bing, typing things in; it’s crazy.
>>: So I am curious, suppose that my task here is Where’s Waldo? So a
variant to Where’s Waldo where there are multiple Waldo’s and maybe
there are 10 in the actual picture and they are progressively harder to
find. So when you ask them how many Waldo’s are in this picture
someone is very easily going to say one and maybe there are 10. It’s
going to take them a lot of effort to see 10.
>> Emery Berger: So again, I think this comes down to truthiness right.
>>: Correct and I totally understand that, but my question is this
though: What’s your programming model API? Suppose I actually care,
not there are less than 10. If there is a 2 percent chance that there
are greater than 7 then that means something to me. Maybe I want to
give a million dollars to somebody. So it’s a difference flip then on
the multi-nomial that you have been talking about, right.
>> Emery Berger: It is, yeah, that’s intriguing. What we have been
assuming is that you ask anybody. So you pick a pool of people who are
either equally qualified or who are self selecting and answer
questions. You don’t expect on Mechanical Turk to be able to find
Scala programmers or not many. But you expect to find people who can
recognize giraffes and the people who don’t know what giraffes aren’t
going to except the hit. Their skills, we assume, are more or less
uniform.
So if you say, “Well here’s a question” what’s going to happen is that
if everybody --. If the question has this huge drop off, like you can
find the first one super easily, but then you have to be unbelievably
anal to find the rest then what’s going to happen is there is going to
be this huge bias in the distribution of answers and you are going to
end up hitting that. If however the curve looks a little different,
it’s a little smoother of a decline then you are going to end up with a
bias towards one thing, but it’s going to keep asking more people. And
it all depends on where the majority lies. But it really is about
consensus right now; it’s not looking for an outlier.
Now we have subsequent work that we are working on right now which is
actually about obtaining distributions. So, where there is no one
correct answer necessarily; there may be a distribution we care about
and that’s something we are working on right now.
All right. So let me continue just talking about more we can do and I
will actually show some code in a minute, but the API is very simple.
So this is a question, in fact a question to which I believe we had a
bogus responder. We actually ran this, this seems like a totally goofy
thing to ask like which ones are from Sesame Street. What we actually
did was we said, “Which of these do or do not belong?” And it was a
way of doing kind of clustering with people being the metric.
is what we call a check box question.
But this
So here every one of them could be on or off. So instead of it being
five choices it’s actually two to the five choices. So the chance of
them agreeing randomly is very, very low except that if you present it
like this people will think, “Great” and they will just agree. Or if
you present it like this then great they will just agree so we randomly
permute them. We turn off and on all these check boxes at random and
so if people just say, “Yeah, that’s it” then it goes and it just looks
like noise.
>>: Isn’t that weird to have the answers filled in?
>> Emery Berger: Is it weird? Yeah, there is some weirdness. One of
the other weird things is, we haven’t implemented this, but we also
have a way of doing drop downs. And the drop down, one of the things
that we want to do if you think about it what’s the easy, lazy thing to
do? Click the first thing, right, so that’s a problem. So what we do,
this is not deployed but we have done it, is that you could think, “Oh,
well I just randomly shuffle them”, but nobody would want to use that.
Like here are all 50 states or here are the 200 countries, find your
country. Nobody wants to do that, right. So what instead we do is we
just imagine the thing as a ring and we randomly pick a starting point
and then it wraps around. So that avoids that bias.
>>: So does Mechanical Turk have native support for this?
>> Emery Berger: It does not; it has no native support for anything
practically.
>>: So have you thought about proposing it to them?
>> Emery Berger: So I have reached out to them. I actually should
probably try to do that while I am here to be honest.
>>: I mean yeah, you are pretty close and realistically it seems like -.
>> Emery Berger: I know somebody who works there and I know that the
Mechanical Turk group is very small, but that Jeff Bezos really likes
them. Like he loves Mechanical Turk, this is one of his pet projects.
>>: [inaudible].
[laughter]
>> Emery Berger: Right, it might be an improvement, yes.
>>: But you need semi-total evidence about the importance of it being
the first choice on the list, ask any politician. They fight to get
that spot under the left column.
>> Emery Berger: Sure, yeah, no, absolutely, that’s certainly true and
unfortunately the randomization is one time for all the ballots that
everybody sees.
>>: [inaudible].
>> Emery Berger: Really, because it’s electronic voting.
>>: Do you know how Schwarzenegger was elected?
>> Emery Berger: Because he is like big and strong and talks like this?
>>: They randomized the alphabet and rotated it.
>> Emery Berger: Yeah and so you think that’s how he got elected?
>>: No, but.
>> Emery Berger: I and Arnold would disagree.
worried about the latter I guess.
You should be more
Anyway, so what else can we do? So we have this RadioButton. We have
check box questions and we also have constrained free text. So here
this is, “What does this license plate say?” And you have a way of
actually answering a question like this. You can’t just type in any
arbitrary thing. Here it’s sort of a bounded length regex, but it
turns out people are not that good at regex in practice so we make is
simpler.
We use something that’s actually adapted from FORTRAN, from FORTRAN, no
why did I say FORTRAN? It’s from COBOL, from COBOL. So these are
picture clauses from COBOL, so I got to site COBOL that was awesome.
So each letter, the number of letters is the total number of characters
and each letter has some meaning. So X means alpha numeric, and a 9
means numeric only, etc. So there are a few choices like this.
Okay. So clearly the range of possibilities is now very high; it’s
staggeringly high. So it’s thirty-sixth to the sixth if they were
choosing at random. So really you just need two people to agree and
you are done.
>>: Sorry, what do you present to the user in this case?
>> Emery Berger: They get a form, they fill it in, and it has got
JavaScript in it and so.
>>: No, I mean in terms of --.
>> Emery Berger: Oh, so there’s the programmer and then there’s the
worker. The worker sees a blank and it says how many characters it is
and then it’s constrained just like when you fill out a normal form
anywhere.
>>: So what’s to stop them from typing A,A,A,A,A?
>> Emery Berger: Oh, nothing, that’s right.
>>: So that’s the easy answer.
So you can fill it in randomly as well.
>> Emery Berger: So we have talked --. So there are a bunch of UI
issues here. So one of the things that we talked about doing is you
can imagine it being, like you know one of those combo locks were you
turn each individual number. So you could totally do that and you
could basically have some kind of a drop down with the alphabet, but it
would be horrible. Nobody would want to do it. People just don’t want
to do it.
>>: I think your answer here is a single letter, right.
>> Emery Berger: It’s a single letter.
the single letter. So I agree --.
So everybody has to agree with
>>: A.
>> Emery Berger: A, everybody says A; A is the answer. So I think
there is no, the problem here basically is the weakness of the random
assumption when it comes to certain kinds of human behavior. So the
question, is everybody really going to type A, A, A, A, A? It’s
probably more bias then typing A, Q, Z, 1, 4, 3, right, but we don’t
know what that is.
>>: It could be any random starting letter.
>> Emery Berger: It could be any random starting letter. And I will
say that with this particular task and because of the long-term
financial incentives this is the kind of thing that nobody writes a bot
to do. This is exactly the kind of tasks that they are like, “No way,
the bot has to interact with JavaScript, no”. They just look for very
simple forms and they fill in things at random.
Yeah?
>>: If you want to find out what people type randomly ask people
[inaudible]. They figure out what a German cryptographer doesn’t.
>> Emery Berger: When they actually type, yeah. So we have talked
about doing stuff like actually --. I mean it’s not research, but it
would be easy, you could just measure the bias of how people type
things. This is not a problem and in fact there is subsequent work
that has nothing to do with this which is about detecting errors in
data. We have this project that we call data debugging and of course
people make typos in very particular ways, but we kind of don’t care.
We can find them without knowing that sort of bias.
So I am going to show you a run. So this is a real execution that
actually fortuitously exposed all kinds of things that AutoMan does to
maintain quality, for pricing and all this stuff. So this is this task
where we wanted to automatically cluster children’s characters,
children’s show characters. So we took a whole bunch of things, we
said, “Which ones don’t belong?” And then you could get like these are
in one cluster and these are in another cluster. This is hard from a
computer vision perspective because it depends on some other
relationship. It’s not about the way the thing looks, it’s related to
the program that they were all on. I mean there might be some
relationship, but.
Okay, so here’s one that popped up.
Which one doesn’t belong?
>>: Sponge Bob.
>> Emery Berger: Sponge Bob, thank you, somebody watches TV. Okay,
everybody else is like, “Oh, I don’t watch TV; I am too educated and
cultured”. But, I have kids and I definitely know what these are.
All right. So what happens? So in this particular case AutoMan has
this strategy that by default is trying to optimize for money. So it
tries to spend as little as possible and it kind of crosses its fingers
and hopes for unanimity on the first round. Okay. So it turns out you
need three tasks to get unanimity here, to get the 95 percent
threshold. So it goes and spawns them and within two minutes somebody
said, “Sponge Bob”.
And what’s cool about AutoMan it exposes a socket that you can listen
to and you can actually see all the activity. So when you write a
program it’s waiting for the function to return, you map this function
onto a bunch of images and it goes. Then you open up the socket and
you can watch this. It’s like worker twelve has arrived, worker twelve
says the response is this and worker twelve’s task was approved and
paid this much, it’s pretty funny.
So then we waited and just a few seconds later we get another Sponge
Bob and then somebody said, “Kermet” and we were like, “Really, Kermet
you have got to be kidding me”. And so one of my friends who is Greek
said, “Well” because Greek’s are really good at this. He said, “Well,
Kermet is the only one who has been on two different shows”. And I was
like, “Okay, you win, but I am pretty sure that’s not the line of
reasoning that went into this. I believe this person just clicked”.
Okay, he was like click, done.
So AutoMan is like, “Well, that’s not enough”. So AutoMan then goes
and spawns three more tasks. The reason it has to spawn three is now
it needs five out of six to get to this 95 percent confidence level.
So somebody comes in a few minutes later with Sponge Bob, but it’s not
enough because we have spawned all these other tasks. We need the
other two to come back. We get this weird delay, there are some time
of day issues with Mechanical Turk, like when it’s dark over the
America Continent and India or sort of the beginning of the day and the
end of the day there is not many people in Samoa working on Mechanical
Turk, unfortunately. So it timed out and if it times out what happens?
So AutoMan doesn’t know if it timed out because people are sleeping, we
could put that in but we don’t, or if people just were like, “This is
not enough money”. So it goes and it doubles the amount of money and
time and then somebody rather quickly showed up with Sponge Bob and
that’s it, we are done. So it returned Sponge Bob for this one and it
cost .36 cents and it took awhile. But I think it took awhile for
weird reasons, but that’s something else we will talk about in a
minute.
Yes?
>>: So I think this is unsatisfying on a number of levels. One of them
is that you have latencies that are very high for tasks that take you I
guess about three seconds.
>> Emery Berger: Yes, yes.
>>: Another one is that there is a degree of unpredictability or
[indiscernible], as well as the cost you end up paying, as well as the
amount of soul searching you have to go through if you are actually
looking at that. It’s a bit too much I would say.
>> Emery Berger: So let me respond to that in two ways. So one the 95
--. Actually do I have this room to 12 or is it actually ending?
>> Ben Zorn: Yeah, until 12.
>> Emery Berger: Because [indiscernible] is running away.
All right so there are two things here: The first 95 percent is a
pretty high bar. And it may be too high a bar to impose for every
single question. So this idea of weakening it so it’s 95 percent or 90
percent, whatever you want for a collection rather than each individual
circumstance could allow you to drop stragglers. So right now we are
stuck, every one of them is a barrier in a sense and we don’t want
that. But what we could do is we could basically adopt this sort of
what MapReduce does. These things took a long time to come back, we
cut them off. And I believe we can reduce latency in that way, but
there is definitely a latency/money trade off which we will talk about
in a minute. So we have a way of dealing with that.
All right. So let me move on and talk about another example. So we
wrote this license plate reader application and you might think, “All
right, this is a solve problem”. Things are better then they used to
be. People have been working on this problem for three decades. There
is an unbelievable amount of literature on what’s called ANPR,
Automatic Number Plate Recognition.
A few years back Maryland actually deployed one of these systems
because they wanted to do a traffic flow analysis. So they had this
very cute idea, put cameras on the highway on ramps and off ramps, see
which license plate came on and see where it went. Great idea right,
because this is a huge problem for civil engineers, they don’t actually
know where people go.
So the send out surveys and they ask people, “Hey where do you go?”
And nobody fills out the surveys or the surveys are unreliable and they
say, “Please keep a diary” and it doesn’t really work. So they were
like, “We will solve this problem”. So they got some company that
promised them something like 98 percent accuracy for reading these
license plates, but this is what they got. So they ran their own
vehicles through the system and in only twelve percent of the cases
were the license plates correctly identified, which is pretty low.
So what happened? So you might think this is just a solve problem. So
first I should observe that this is people going at a pretty good clip,
they are going fast. And these are cameras that are mounted up high so
it’s not like if you have driven in the east coast and they have these
easy pass or fast lane things where the camera is here, there is a big
light and you have to go slow, all of these sorts of situations. There
are many other factors; it turns out this is actually a much harder
problem then you think. So you all can probably see that plate. This
is a non-foggy day in San Francisco, well illuminated and works great.
This is a screen shot from the I90 traffic cam in Boston. So this
license plate is a little hard to read. So there is this problem of
weather, visibility and there are all these other issues.
So Massachusetts compared to many states has relatively few plates, but
it has quite a number with a lot of noise. And this is the California
situation, every state has tons. There are tons and tons, different
fonts, different visibilities, so it’s actually fairly complicated. So
being that this is such a well studies problem and computer scientists
are involved you will be unsurprised to learn that there is a benchmark
suite. So these are actually drawn from a benchmark suite of license
plates. These are the most difficult license plates for automatic
systems to read.
>>: [inaudible].
>> Emery Berger: Yeah, but it’s not bad. I mean some of them if you
look at this you are like, “That’s pretty straight forward, but this is
a little weird. Is it an M or is it a W?” You can see some of the
problems, but that’s a CAPTCHA. So that’s a screen shot of a CAPTCHA.
What the hell does this say? So it’s actually maybe even easier than
CAPTCHA, but it’s getting close. It’s like around the same complexity.
So it’s hard for computers to solve this problem.
All right. So my student Dan Barowy who was here last year as an
intern is the lead student on AutoMan and this is a task that he posted
and it even says his name which I told him was a very bad idea, but
anyway. So it popped up and he got angry e-mails every now and then.
So this goes back to somebody’s question about people complaining about
not getting paid. So the amount of money that comes up here, see
there’s .06 cents per hit and its 30 seconds. There is your clue that
it’s AutoMan, that’s pretty much a guarantee. And there were some
number of hits available, it put these things up and you had to go
ahead and type in the answer.
So when he first did this we actually weren’t aware of the benchmark
suite. So we had to populate it with images of cars. And I said,
“Dan, go find images of cars”. And Dan found this website called
OldParkedCars.com, which is the internet and you are like all right.
So there is a website that’s nothing but photographs, you can see, of
cars like this is an interesting car I found parked somewhere. So
those images he actually combined them with Google Image Search to get
more images of cars. Then they had to be scaled to the right size and
he de-speckled them and did whatever. So this is all the interaction
with Java stuff. So this is because Scala is in the JVM it interacts
directly with the stuff.
So it was really easy. We posted the images are on Amazon S3 and then
there is a whole ton of them and it’s basically two maps. So he mapped
the function and one of the functions is: Is this a car? You might
think that’s weird, but if you search for cars and you take the images
from old parked cars some of them aren’t cars. Some of them are like,
“My name is Joe” like it’s a selfie in front of a car. And then he
takes pictures of cars. There is a lot of noise. And then the next one
you find out it whittles down the list, it then maps a function which
is, “What does the plate say?” and it returns strings.
So what does this code look like? It’s pretty simple; this is actually
more than what you need. It added some stuff which is the default so
here is the function “Is_Car”.
So Is_Car takes a URL as an input and it’s a RadioButton question which
means there is just some fixed number of choices. Here we have a
budget for this question. We will never pay more than a buck for this
question. You don’t need to do this. It’s not a bad idea. Here’s the
confidence, which is a default. By default it’s 95 percent. Here’s
the question: Is this a car? Here is an image that pops up and then
this is just Scala to represent the string and the actual sort of
[indiscernible] value if you will that comes back. So these are the
return values. You either get a tick yes or a tick no.
All right. That’s actually the whole function for Is_Car and then this
is the license plate function. It’s “What does this plate say?” here
is the URL. It’s a “free text” question and it has this pattern down
here. And the Y means they are optional alpha numeric characters, so
more COBOL picture clauses. So we ran this program and the first thing
it does is says, “Is there a vehicle?” It spawned these tasks and a
number of workers came back. They said yes, they said yes, we got
unanimous agreement, good and then it goes and asks, “What does the
license plate say?” So it’s a little hard to read the results up here,
but this one is actually a Washington plate by random chance and
767JKF.
>>: How about JFK.
>> Emery Berger: JFK is one of the responses.
>>: [inaudible].
>> Emery Berger: Yeah, it’s a typo or maybe just a bias, a weird bias
like somehow JFK, but this was enough, two people disagreed.
>>: I actually got out of a ticket one time because of that.
license plate was FYL and the meter maid typed FLY.
My
>> Emery Berger: Ah, nice, that’s good. There’s a lesson there for us
all. We should all get personalized license plates.
[laughter]
>>: As close to a real word.
[laughter]
>> Emery Berger: Exactly, that’s good.
So they disagreed, it spawned more tasks, they initially timed out so
it spawned them again for .12 cents and then it got this thing back.
So there is actually an interesting feature that I wasn’t aware of
until Dan pointed out to me that if nobody has taken your hits you can
withdraw them. So it actually cancels these outstanding tasks. So
when somebody says they accept them they got them, but if nobody has
actually said, “We accept them” then it will cancel them. So I was
like, “Hmm, I have a really good idea of something we can do”. So I
will get to that in a second.
So here are some results. We ran this program and this is the suite of
these hard images. It turns out they are all from Greek trucks, so
Greek trucks are hard to read. So it’s called the LPR database. We
used the extremely difficult data set. The accuracy received was --.
So it was a 95 percent confidence level and we got 91.6 accuracy. Now
how did we measure ground truth? Shockingly this benchmark suite does
not include the ground truth.
[laughter]
>> Emery Berger: No kidding, it’s crazy. I guess nobody has any hope
of answering these anyway so nobody has ever tired. So Dan and
Charlie, who has also been an intern here, went, sat down with the
images and magnified them and then had arguments over which ones were
which. So they went through all 144 plates and some were really
straight out of a bad Sci-Fi movie it was like zoom, zoom, enhance,
enhance, like is a W or is it an M; is it a V or is it a U? Some of
them are really hard and people make mistakes on them. This is a lot
better than 12 percent and this took us 15 minutes, writing this app.
I was like, “Dan go write a license plate thing” and it was done.
Okay. It costs .12 cents more or less per plate. The latency was
actually pretty low. It was less than two minutes per image. So it
really does depend on what kind of task you post, how many hits there
are, there are a lot of things that are difficult to account for. All
right, but there are some things that we exploit and I am going to
explain one of the things that we exploit to reduce latency.
So here are a couple of things we are doing. So one of the things I
have just described is that we have just three types of questions.
There is RadioButton, there is CheckBox and there is this restricted
free text. We would actually like to be able to handle arbitrary
questions like, “What’s the answer” and “Fill in the text”. There are
problems here; part of the issue is you can get answers where there is
no way anybody is ever really going to agree.
So here is a canonical problem: Translate this paragraph into another
language. The odds that people will agree are either vanishing low so
you will never get consensus or super, super high because they all use
Google translate. So both of them are problems and so what we are
trying to do is turn this into a problem where we actually throw back
the answers to the Mechanical Turk workers and have them rank them.
And we will adopt an approach that basically gives you confidence in
the resulting rankings.
Then there is this question of optimizing time and money.
--.
Yeah?
>>: How does it work if they all run to the same service?
So right now
>> Emery Berger: Oh yeah, so all of them will be canonical so if
everybody only uses Google Translate then there is only one answer and
we will be screwed. So hopefully there is some reasonable person in
the audience who is not using Google Translate. But if only one
response comes back then there is no ranking and you would have no
confidence. It’s sort of like asking one person. So yeah, you
canonicalize them all, take all the ones that are different and ask
people, “Which one is better?”
>>: Going back to the license plates, could Maryland afford to pay .12
cents per scanned license plate?
>> Emery Berger: Yeah, that’s a good question. So it turns out there
is a commercial system that some police forces use. They have a camera
that’s mounted on the roof of their car. They just drive around and it
reads license plates. What it does is take pictures of the license
plates that are in front of it and sends them off to a giant server,
just some cluster and it does some sort of vision processing on it and
returns results. It costs 100,000 dollars a year to subscribe to that
service per deployment. So it’s non-trivial, so .12 cents a plate, by
the way, this is .12 cents a plate and most of that cost actually came
from ruling out whether they were cars or not. If you know it’s a car
you don’t have to ask anybody. So if you say, “Well, I have got a
picture of a car in front of me and it’s not a donkey or a dude
standing in the rode” then you can get consensus even cheaper.
Okay. So in AutoMan time is money, but as all of you may be aware
other people’s time is not worth as much as maybe your time. So a
workers time we have said is 7.25 an hour. How many people here get
paid more than 7.25 an hour. Oh, I am never working at Microsoft
again; one person. So, most people make more money. So let’s assume
that your time is worth 200 an hour. This is my Mitt Romney slide. So
really, do you really care about being super, super judicious with the
expenditure of money? Maybe you care more about latency.
So here is this mechanism and what you can do is say, “Look 3 is the
bare minimum”, but what AutoMan let’s you do is actually take the ratio
between your time and the workers time value. And it will spawn tasks
so that in a sense the worst case situation is you do it yourself. So
as long as it is cheaper then you doing it yourself it’s a win and you
may want it to be faster.
So you can actually set some time value multiplier and what it does is
spawn way more tasks than it needs, but because Amazon allows you to
cancel outstanding tasks once you reach your default threshold you can
cancel them. So what this does, well the question is how much do you
want to pay? So this is just a dial; it’s a risk.
It’s a risk, so you can bound your risk. This is the minimum you pay,
the default is the minimum, but if you want to get faster answers you
will take a risk of in the worst case spending that multiplier more.
But in the best case you will spend exactly the same amount, but there
is a bigger pool, there are more hits available, more people will sign
up to do the task and you will get the answer faster.
And you can think of it this way too, the way that it works right now
is you have this barrier. It says, “I create the bare minimum. If
everybody agrees we are good.
So we can --.
If one person disagrees it’s round two."
>>: I don’t understand your comment that you will have a “bigger pool”
because it seems to me that the latency is in the discovery of the task
not the actual completion of the task.
>> Emery Berger: Ah, so, maybe.
>>: But what are the minutes because you had minutes for those things
and those tasks take seconds.
>> Emery Berger: Yeah, the tasks actually take some number of seconds
that’s true.
>>: So unless you sort of had a lot people --.
>> Emery Berger: So think about it this way, there are two things you
need to think about. So one is suppose that I just post three tasks, I
say three because three is the bare minimum. There is a risk that I
won’t get all three back correct, they won’t all agree. So the way
that it works is those three will get posted, then it will come back to
the system and say, “I waited for all three and now I am going to go
and do more”. I was super optimistic with respect to the likelihood of
consensus.
So what I can do is I can dial down my optimism by extending the number
of people I am willing to ask. So in a sense what I am doing is I am
taking multi-phase barriers, so barrier one, barrier two, who knows how
many barriers and doing more work up front in parallel, somewhat
speculatively. So it’s a kind of speculative execution in a sense. So
I launch these tasks and which ever ones come back faster I just kill
the other threads. So it’s a way of reducing latency.
>>: Yeah, I was going to say it’s really related to the cost of the
time it takes to kill them, right, because the only risk is that in the
time between when your task is complete and the time that you kill all
the rest that people enter the system and complete them.
>> Emery Berger: Yeah, right.
>>: And all the tasks can get accepted and not completed.
>> Emery Berger: They could, that’s your risk.
>>: I think that real risk is the amount of divergence, right. You
have a race to kill these things and then more of them jump on these
things. Suppose you have 5,000 [indiscernible] and thousands of people
can do this thing. And then there will not be any consensus just
because of the sheer number of these people. And then [inaudible].
>> Emery Berger: Oh, well that’s not actually how it works so I don’t
really understand the confusion.
>>: Why won’t you have divergence?
>>: Yeah, it seems to me you have a problem. If you have three answers
that agree then you would say it’s correct. But if a fourth task was
accepted and that guy is going to disagree with you now you need five
out of six, which means you are going to have to go back.
>> Emery Berger: Sure, sure, sure, sure.
>>: So it seems like when do you stop?
>> Emery Berger: Yeah, but you are making an assumption. You are
making a conditional probability assumption that changes the odds. So
you are saying that if this already happened then there would be this
greater risk, but you need to look at it differently, right. The
question is: If I spawn three tasks what is the likelihood that those
three tasks will all agree? Or if I spawn four tasks what’s the
likelihood that three of them will agree?
>>: But I don’t thin that’s important. I think what’s important is the
possibility of this kind of divergence, right.
>> Emery Berger: Where people, so the divergence --.
>>: You run the risk of spending way too much money just because of
this very attractive task.
>> Emery Berger: So this business of the multiplier, maybe this is part
of the confusion, this doesn’t affect the rate of pay for each
individual worker. So the rate of pay remains. They still get say
7.25 per hour. What I am doing is not paying them more; I am just
giving more people the opportunity to do the task.
>>: And the first three that --.
>> Emery Berger: And the first three that finish --.
>>: So if they came in and are good we are great.
>> Emery Berger: Yeah, that’s right.
>>: But I think that’s a bias towards to bot’s and Homer’s right,
because they were clicked fast whereas the guys actually looked at the
license plate. [inaudible]?
>> Emery Berger: Yeah, so that’s an interesting question. So the bot
thing I still think --. So you are saying its bias because they were
fast. So you can do things and people have done these things, we don’t
right now, where they measure what is known as paradata, yes paradata.
Paradata is where you actually track response time, you track mouse
movements, and you track if they click or hover over things, all of
these sorts of things. And so it turns out that people in general
don’t move the mouse directly in a perfect path straight to the answer.
But of course you can imagine you gaming less. It becomes some weird
arms race. You could write a bot that appears to cogitate and it’s
hesitant and like, “Ah, how about this one”? So then you are back in
CAPTCHA land. So I think that it’s a fool’s errand to try to
distinguish people from bot’s in any other way then by using this
consensus based approach.
>>: [inaudible].
>> So the [indiscernible] would say you can’t actually make the
assumptions you want to make if you don’t add this extra bias of
[indiscernible].
>> Emery Berger: Um, yeah, I am not sure. I mean of course the bot’s
could be answering, they could be sniping all of your tasks. So I
don’t really see why it changes the game to have this approach. It
doesn’t seem any different to me then the first approach.
>>:
[inaudible] that’s why.
>> Emery Berger: No, but if I have bot’s --.
>>: It’s just like when they do [inaudible].
>> Emery Berger: Okay, I see what you are saying, right.
>>: You are increasing your population, but you are keeping arbitrary
the [inaudible].
>> Emery Berger: Okay, now I see what you are saying. Right, okay,
yeah, yeah, yeah. So your point basically is that if I ask a thousand
people and if basically have changed the set, the pool of people I ask
is it different then just asking three people?
It is different, it is different. I think that this is an unfortunate
and difficult thing to deal with because if you think about the way
statistics normally work, what you are supposed to do, is you set a
sample size in advance and you query it. And of course in this
situation if we do this then what we are going to do is we are going to
end up in many cases not getting an answer or asking too many people.
So it’s very difficult to do this adaptively in a way --.
>>: There are very well studied ways to statistically --.
>> Emery Berger: That’s right, that’s right, so there are something
called the Baum, Bom, I can’t even speak right now, the Bonferroni-Holm
correction, that’s what it is. And so what you do is you basically
have to increase the P values every single time you do it. So we have
definitely thought about this. We have not done it yet, but it is an
issue.
>>: [inaudible].
>> Emery Berger: That’s right, yes.
>>: You want to reach 95 percent accuracy; you go one by one and you
whenever you reach it. There is some rule that [inaudible].
>> Emery Berger: There is, yes and that’s sort of how it works. The
way it works, it is a little more subtle, but it’s kind of like that.
>>: But if you do that I don’t see the difference between doing a
million and just three.
>> Emery Berger: Sure, sure.
>>: Sorry, but I don’t see how that address my concern that if the
first three agree and then the fourth one comes in [inaudible].
>>: But why would they?
agree?
If you have four why would the first three
>>: No, no, but that’s [inaudible] and if you cut off there you are
going to say, “I am done, I have my confidence”, but if you actually
considered the fourth that happens to come in then you are now below
and you have to wait for two more, right?
>>: But that’s a specific event that may happen.
probability of working with [inaudible].
And you have the same
>>: No, no, but I mean [inaudible].
>>: [inaudible].
>> Emery Berger: So it comes down to this correction, basically it’s a
question of sequential testing. So that’s the whole point of doing
this correction is that what you do in a sense there is a rejection
criteria in the changes. And so as you ask more people you have to
reject more and more of the answers to get the same level of
confidence. So this is something that we should be doing, but we are
not yet doing. It is going to go in. There is something that is
actually a more serious problem which I want to be solved in the same
framework. So I think this is going to get to your question.
So suppose you have dependent tasks? So I keep talking about these
things like they are just one function, right. I want to know this one
function. So I have got some function F and F is a human implemented
function and has a 95 percent confidence level. Let’s assume that we
get 95 percent confidence. Now suppose I have a function G and it also
has a 95 percent confidence, awesome, until I do this. What if I
compose them, right? I say F of G of X? Now I don’t have 95 percent
confidence; now I have 90 percent. So what do I do? So I need to take
into account the actual flow of the computation and the amount of
uncertainty, okay.
And we want to do it in a way --. So how can you solve this problem?
So apriori what you could do is you could crank these things up really,
really high so that you always have 95 percent out at the end. So you
would be like 99, 98 and then you are done, right or 99, 98 and 99,
which one should you do; flip a coin? No, that’s not really smart.
So what we want to do is do this dynamically. So if we discover that F
is relatively cheap and relatively fast and G is relatively expensive
and relatively slow what we can do to boost confidence is actually
spend more on F and less on G. And so now when we multiply .99 x .96
we get .95. So we can do this dynamically and kind of optimize on the
fly to get the most bang for our buck in a sense. So that’s a coming
attraction and it’s going to fit in with the correction.
Yeah, so good, perfect, right at the end. So in some we built this
system for programming with people. Okay, the mystery door that keeps
opening. It’s fairly easy to program. Once you write these things
they behave like ordinary functions in your programming language and
you can do what you want with them. Nobody actually asked this
question, but they are of course implemented as futures. So it’s not
like everything blocks when you invoke a function. They are all really
designed for you to run map on a bunch of stuff and then only when you
need it does it actually go and block and everything get’s spawned in
the background. It takes advantage of the amazing embarrassing
parallelism of people, which is great.
So we built this thing, it solves all these problems, especially the
Homer and Bender problem and you can download it. It’s on a
[inaudible] and it’s on this web page automan-lang.org. I know it’s in
Scala and one of the next things we are going to do is put Java
wrappers around it. We are thinking about Python. I am not sure we
are going to handle C++, although Dan is really a big fan of F++ right
now, so you never know.
Anyway, all right, thanks’ for your attention.
[clapping]
>>: Very good talk.
>> Emery Berger: Oh, thanks.
Download