>> Ece Kamar: Hi, everyone. It's my pleasure to... PhD student at University of Rochester. He is working with...

advertisement
>> Ece Kamar: Hi, everyone. It's my pleasure to introduce Walter Lasecki. Walter is a
PhD student at University of Rochester. He is working with Jeff Bigham and James
Allen. He is a second time intern at MSR. He is also a Microsoft Research fellow. Walter
is mainly working on interactive systems that are powered by the crowed and machine
intelligence, and he is going to tell us today about those systems and the work he has
been doing with us here at MSR.
>> Walter Lasecki: Awesome. Thanks, Ece. So yes today I'm going to be talking how
can you use crowdsourcing to actually enable robust, interactive, intelligent systems that
are versatile enough to be put in the wild with real users and see what happens. So this
is work in collaborations with both Jeff Bigham and James Allen. I'm going to start off
today with a little bit of a background on human computation; then, I'm going to go
through some of the existing work in crowdsourcing, something that's kind of laid the
foundations for what we're doing these days; and then, I’m going to look at how we can
use this crowd agents model that we propose to enable truly deployable intelligent
systems; and then, a little bit about future work on that front.
So human computation is in many ways a field revisited. In his book "When Computers
Were Human," David Alan Grier kind of outlines the history of computers not as
machines necessarily but kind of the role of a computer or the work of a computer back
when it was someone who did computations. And it's a very interesting history because
of course this role goes way beyond machines. So eventually people were replaced by
machines because for raw calculations for simple math, machines just really can't be
beat by people. But there are still things that humans can do that computers can't yet. So
we're seeing a resurgence in using people in a computational process.
Now more recently you get a lot of human computation kind of in the form of
crowdsourcing, but even crowdsourcing isn't new. Distributing to a large group of people
is what the Works Progress Administration did in the 20's to compute large scientific
tables. And they have had workflow processes that were focused around distributing the
types of tasks that each worker was doing in a way that prevented similar errors so that
you could recombine everybody's answer and get a more reliable result.
So some of the differences between that and what you hear these days is, well,
crowdsourcing is now an open call. It might be on the web. It might be to your user base.
It might be to just anyone who wants to come along and help. But in this case we're
often warned about these malicious users especially in kind of semi-anonymous online
work; you don't have any guarantees that this person on the other end will actually do
your task, will do it correctly or has any expertise to do it. So you get quotes like this: so
workers "are inherently lazy and will avoid work if they can. Workers need to be closely
supervised and a comprehensive system of controls developed. A hierarchical structure
is needed with a narrow span of control at each and every level." Workers "will show
little ambition without an enticing incentive program and will avoid responsibility
whenever they can."
This is a very pessimistic view of workers but the interesting part is this is not about web
workers. This is actually from 1960 and it's talking about regular full-time employees.
McGregor was at MIT Sloan I believe when he wrote this book. But Theory X, the way
he saw it, was not necessarily the full answer. And I'll get back to this before the end of
the talk. But it's just kind of an interesting idea that we've already looked at workflows in
this same light; we've already looked at workers as having needed these tight controls
and incentive mechanisms.
So what's different now? What do we know now that we didn't know 30, 40 more years
ago? Are people smarter? Did they suddenly overtake computers in their ability? That's
not it. Computers can still do what they could much faster. In fact if anything, the gap has
closed between computers and humans. But what we do have is more technology
around us, more computer systems, more distributed systems that all can communicate
with each other and the user in a way that lets us integrate people and the way people
work in a different way. So if you think about a traditional workflow model or a traditional
hiring model: you hire one person and if you want to make your system more intelligent,
if you want to answer queries with human intelligence then maybe you have that person
sitting at a desk just waiting for you to ask a question. And they wait there all day and
maybe if you hold odd hours, you have to have multiple shifts of people just being paid
to sit and wait. Of course that doesn't really work but now with online labor markets such
as Mechanical Turk, you can hire a person to do a very short task and there is someone
always available on demand. In fact you can even get a lot of people on demand and
parallelize tasks that would have been impractical parallelize before because you would
have needed to hire all of these employees full-time.
Now there are differences here, of course. If you hire a full-time employee, you have
some guarantee, at least some expectation that they're going to be around. But unlike
traditional employees, crowd workers might come and go as they please. You never
know exactly who's available, and often you don’t learn enough about any individual
workers to really have a guarantee on their expertise. But you can at least get these
workers fast. And you see work like Adrenaline and Quick Turk It that have looked at
actually being able to bring in a new person but you don't know anything necessarily
about that person.
So the traditional fix to this is to take your task and break it up into a lot of little different
pieces that people can complete quickly. And as long as they're completing it quickly
then you don't have to worry about this high chance that they'll leave before the end,
before the task is accomplished. But if I distribute this to a group of workers then I don't
want three answers to come back most of the time. Most of the time I want a single
answer for my question. So you have to start looking at ways to combine the multiple
workers which might provide you with more confidence but in a reasonable way.
Sometimes you might average, sometimes you might take a vote and select answer, but
either way you're getting more confidence than you have a single worker by using
multiple people.
There are of course different ways to do this -- Once the task is complete you can put it
back and select the new one. But, there are multiple ways to do this. So for some tasks
instead of looking at completely parallel work flows where everybody gets the same
piece of information, they complete and then they all compare answers in the end
essentially or the system compares answers for them. We can start to look at systems
where there's an iterative process. You get slightly better answers as you take a task
and pass it from one worker to the next, to the next and eventually you converge on an
answer and you can mark it as done and move on to the next piece.
Now this takes a little bit longer but you might be able to get more expertise in a lot of
cases when you have people building on each others solutions. So in addition to just flat
models like that you start to get more and more complex models. Soylent used the FindFix-Verify paradigm to have one group of workers select the task, pull it off and then,
have a different set of workers actually identify what a solution to that problem is. They
could finally pass it along and have a different group of workers verify it because one of
the problems with an iterative model is that at any point if I'm passing it along from one
person to the next, one of the workers could simply delete the content or make a change
that makes it worse and makes it harder for other workers to complete their task.
But this kind of verifies that at least within each loop you have some improvement. So
now that we have the crowd and we can even get reasonable responses back from
them, what do we do with it? Right? Great, we can bring human intelligence to bear on
problems but what types of problems are most interesting? And for this we look at
artificial intelligence because in many ways they have the exact same goals, and they've
kind of dreamt up what they want but maybe don't have the final answer. So there's a lot
of overlap in interesting problems here. So just as one example, conversational
assistance have been a goal for a very long time.
So Eliza was the late 60's I believe. The problem with systems like this was that they
were very, very constrained. So you had a template of what kind of interaction the
system expected and if you didn't get that then kind of all bets were off. If I start asking
Eliza about booking a hotel then all that happens is we end up getting into a very shallow
conversation about who was messing up the conversation first. And even if you fast
forward to today, we have much better attempts at conversational interaction with
systems like Siri or if you're a little less concerned about portability, Watson. But in both
of these cases they're still looking at very specific formats for the conversation. They're
still looking at very specific use cases if nothing else. But what we want is conversational
assistants who can actually help us.
We want whole intelligent systems that can not only come up with what our solution is
but think about our intents, think about why we're asking this problem and maybe even
answer a question that we haven't asked yet. So things like conversational systems are
something that can make people's lives a lot more convenient, but there are roles where
this type of intelligent interaction can actually be much more powerful, can actually have
a transformative effect on the daily life of users.
Jeff Bigham and I really focused on assistive technology for this reason because being
able to provide a blind user with a way to interact with their desktop through
conversation alone or being able to help a motor impaired user use a predictive
keyboard that actually has a better sense of what you're typing, activity recognition for
cognitively impaired and older users who might need a little assistance and prompting
throughout their day to safely live unaccompanied, to things like captioning and
transcription services for classroom settings where students who are deaf need to be
able to follow along with the lecture but it's very expensive right now to provide this using
a single expert human, and even to things like visual assistance where a blind user can
query their environment and find out more about their surroundings. So the take away
from this talk, I really hope, is that we can empower intelligent interfaces, this kind of
more broad goal. But most of the examples that I will show you apply these intelligent
interfaces to specific assistive technology problems.
Remember, this is the basic model we have. Take task, break it into pieces, send to
people. But how do you control an interface? How would I break this task up into a lot of
little pieces that individual workers can control? And this problem is actually what got me
into the crowdsourcing space because I looked at it and I said, "Well I mean I
understand human computation. You take a problem that your algorithm doesn't quite
solve, you put some people in it and then, you get yourself a mission accomplished
banner and you're done." But it doesn't seem to quite work for this interface. I could
break the task up by button let's say. I could have everybody controlling an individual
button, but that doesn’t really get at the heart of the problem. We don't have any
consistency between controls. If you imagine this is controlling a small commercial like
off-the-shelf robot.
But you can see one of the options here is to extend the camera. And that's great, but if
somebody chooses to do that to look over a barrier while someone else what's to drive
under it, you'll break off the poor camera. So we want consistency between actions. We
want like the whole of what the system is doing to be self-consistent. But backing out if
we had things like breaking it down by frame and I have workers look at the current
situation and they say, maybe, that I need to drive forward, then we ask a whole different
set of workers, "Okay now what do we do next?" Again, we have no consistency now
across time instead of across the actions that are selected. So the difference between
this and a normal task is that it is continuous. We have a person who is interacting with
the feedback from a system. And they're getting this feedback from a system and they're
getting this feedback as kind of constant input. And what they want to do -- and if this is
just a typical end user -- they'll take off a piece of that task -- now maybe in the interface
control it's knowing when to press a button or knowing how to react to some situation -and they'll perform a task and provide us input back to the system.
Now another problem is that even that model is not as simple as it sounds because if
you're not able to keep taking new tasks, they're not going to slow down. The
environment is not going to change because of the way we selected to control the
system has to wait for a set of workers to come back. So if this worker holds it too long,
all of these tasks are going to back up and we're going to end up with incomplete
sessions or something is going to be missed in the environment. Now in general if we
want to crowdsource this, we can look at just parallelizing the input. So then the same
time of task division only now we're doing it with continuous streams of tasks. So the
task comes in, each worker takes off a piece; but then what? We can't give the system
three inputs at once. We can't even actually know that the workers all grabbed the same
piece of the task.
Maybe they saw different problems. It's really this combination step that gets tricky. And
at a high level what we want to do is just merge them back together, but what that looks
like is very domain dependent. So we started about trying to fix this by wanting to
abstract away the crowd entirely. So what we if we could look at the collective
intelligence itself as an individual entity. We call this a crowd agent. So internally you
have some specific -- there are two mediators: a task mediator and an input mediator. A
task mediator handles whatever breaking up a problem in your domain looks like,
passes that to workers and then, the input mediator takes the workers' output and
combines it all back together.
But of course both the task and input mediators are going to be very domain specific. So
we have to start thinking about what do these look like? How can you generalize this
model? But going back -- I'll say real quickly, by providing a single user, if you can
behave like a single user, if you can take input continuously and provide output as a
single stream you can see the inputs and outputs are basically identical to what you
would have if it was really just an end user. So we don't have to change the
environment; we don't have to change the task that's incoming in anyway. We can start
to communicate with existing computer systems or existing users in a very natural way,
and we just have to process this all internally.
So looking back at the example of the robot: this is kind of one of the first pieces of work
we did. And I'll have the citations along the bottom there so if anybody wants to kind of
follow up. I'll be happy to answer any questions but that might be a nice resource. So
what we're looking at is even more general than just controlling a robot. How can you
control the interface that controls the robot? How can you actually take any UI that's out
there and control it with the crowd?
What we came up with was Legion, and Legion allows you to select a piece of your
interface right here, just drag a box around it and then, you'll set some parameters like
price and a small description for the workers and you'll set what keys they're allowed to
use and what mouse input they're allowed to use. The system automatically starts to
recruit workers from Mechanical Turk. And when they arrive at the page they see this
little bit of information you've given them, so they see a quick description, some of the
keys that they are able to use and then, they're given what is essentially a remote
desktop connection to the system.
However, of course, we mediate their input so that one worker can't just start controlling
someone else's computer ad hoc. But the way we figure out how to combine this and
how we compare people to find out if they're doing a good job is actually to use the
group not as a set of input votes. We're not looking at how many people just pressed up;
"Well, it is a majority so I should go up." Instead, we use it to select a leader. And then,
we give the leader short periods of control. What this is does is essentially allows people
to overcome small decision points where you get to a point where you can make two
equally good choices. You can't get agreement in the crowd because there's no
communication in this case between the workers; we just want to see what people would
do.
But we allow whoever is in control to kind of get around that point and then see if people
start to agree with them again. So as we go each worker is providing some input. And
you can see that we switch from yellow to blue here. We always ignore the red malicious
worker at the bottom because there's just not enough agreement. The way we calculate
this though is actually to take the input from a single worker and compare it to the
aggregate. So we look at every input that was given over that time span which is usually
about half a second to two seconds, look at all the input and we say, "Which part of that
was theirs?" And we can compare this by looking at the vector cosine which is
essentially the projection of the vector of input from the single user onto the crowd's
answer. And we assume that the crowd is correct. Yeah?
>>: So you have to get this whole bunch before you actually decide who's [inaudible]?
>> Walter Lasecki: Yes. But we do is it's essentially the vote for the next period. So we
go through and everybody provided input with one person in control. Right? And as soon
as you hit the next piece -- We use the voting data.
>>: Right. So if I say I want the robot to go right, up, up, left, I have to do all that in my
head without actually seeing...
>> Walter Lasecki: No, no, no. So we're letting one person control the specific instance
but we're looking at how much they agreed with other people who were concurrently
controlling to select who the next leader is going to be for the next one-second span. So
you're being rated based on your historic performance and then we pick the best
representative from the crowd going forward. But that's a good point, thanks.
And using this we can do more than just kind of de-noise the crowd. We also looked at
this kind of fun application of letting people play single-player video games as multiplayer. And it was kind of interesting we can start to do -- I think the reason I'm
interested in this work is really we can start to use multiple Legion systems concurrently.
So we actually are averaging across different subsets of the controller here. So we don't
want everyone to have to agree on all of the actions. If you want to control a piece of the
system, maybe you just want to work on like special abilities and jumping or whatever in
a video game, you can work on just that piece. You don't have to participate in the larger
task. You can specialize or you can collaborate with others. And if multiple people are
trying to control the same subsystem then you can get the averaging. Yeah?
>>: So it is confusing for the workers to give input and kind of see that their input is not
followed?
>> Walter Lasecki: Yeah, so...
>>: Because if the person says, "I want us to go left" then I see it going right and I kind
of need to make a decision again.
>> Walter Lasecki: Right.
>>: [Inaudible]...
>> Walter Lasecki: Yeah, so you kind of have to adapt to...
>>: ...[inaudible].
>> Walter Lasecki: Yeah.
>>: Then how does that...
>> Walter Lasecki: Mediated?
>>: ...interaction go?
>> Walter Lasecki: So it turns out that in a lot of these cases if you have reasonably low
decision point -- Well, so basically a decision point is any point where you could have
equally good options. If you have a reasonably small set of those throughout the course
of the plan then you'll diverge very little from the crowd if you're trying to do the right
thing. So if we're driving to the teapot here, they can avoid the obstacle in more or less
one way. Now there is a little bit of leeway in there, but if you start to disagree too much,
you have a radically different plan from the rest of the group or you might just not be
giving a valid answer -- So the short answer is you actually don't always notice that it's
not listening to you except for in select cases. And when you do, you can see that
basically it acknowledges your input. It's not that the system didn't hear it. You see that,
yes, you pressed forward. But it does tell you, though, the workers are informed that
they're working as part of a group. So right now it's listening to the crowd. "We heard
your up but we're going to listen to the crowd for this moment." And because you can't
take too large a step away from what you were just doing, the replanning process for
people is very simple.
So if it takes a slight right there is not a huge cognitive difference there.
>>: But the advantage of being [inaudible] some simple [inaudible], you want to have
some kind of continuous decision making coming from the same person or is it
[inaudible]...
>> Walter Lasecki: So part of it is that we want to continuous decision process
throughout. And what you get is because we're ranking similarity, the people at the top
all tend to agree with each other. So even if you switch who is in control for that small
moment, the general plan is usually the same. But the real big advantage is, yeah, you
get to make these decisions online. If you stop and vote first, it greatly slows down the
system. If you have to actually react to things in the environment, it slows it down. And
what we were able to show, actually -- And I don't have the chart on here -- is that
compared to other simpler input mediation strategies this actually is not only faster but
more reliable. So the average time if you just gave it to a single person tends to be a
little bit slower; it's like 10 to 20 percent -- Faster. Sorry -- 10 to 20 percent faster than if
we did it with the crowd using this mechanism but they only complete 40 percent of the
time. And in our trials I think we completed 100 percent of the time. So you get kind of
the combination of speed and reliability.
Other approaches that were similarly reliable, took two-three times as long often to even
complete a simple task. Did you have a --? All right. So it can be applied to other
interesting domains. But one thing that Legion kind of relied on is the fact that all of the
workers could concurrently complete the same task. And we do so in a somewhat
consistent way. But what if your workers are not actually able to do any of them or
actually unable to do the task itself? And we see this problem in captioning.
Professional captionists are really highly trained to years of training in classes to be able
to type 300 words per minute which is actually physically impossible on a regular
QWERTY keyboard. So they also have to train to use these coding keyboards, the
stenographer machines. I don't know if you've actually ever seen one of those.
So we're not going to find any of these people on Mechanical Turk. We certainly are not
going to find five of them. And if you consider that the average stenographer charges
something like 100 to 300 dollars per hour for their services, we don't really want to find
5 to run concurrently. But using the same model where we have the task broken up and
then sent to many different workers, we can actually just have workers type what they
can. And we can divide up the task or just kind of have people self-select from the task.
It turns out the dividing is a little bit more consistent because you get natural saliency
adjustments throughout the stream. So I, for example, might always caption the first
word and everybody in this room would too but we might miss the tenth word after that.
So we can divide up the task. We can ask people to type however fast they can, 40 to 60
words per minute is plenty. And then, by using multiple workers and an alignment
process that stitches everything back together, we can actually still cover the complete
stream. So we can get this full answer and what we do is use multiple sequence
alignment. There is a separate paper here that looked at how we can use an A-star
search to actually improve this to the point where can run it within under a second and
get some error bounds on the correctness of the alignment itself. And this works very
well. And if you think that what we are looking at here is maybe three to five workers
able to cover three hundred words per minute then -- Excuse me -- if I'm paying students
that can type that fast or Mechanical Turk workers a reasonable wage, maybe 10 dollars
an hour, then we're at 50 dollars or less per hour to get the same service that we
would've been paying 150 for before.
And this is also to make the problem approachable by more workers. So I couldn't
contribute to professional captioning even if I really wanted to, even if I would charge
less. No matter what I did, I couldn't do it. But by lowering the barrier to entry using this
system what we're able to do is say that we have a much bigger worker pool and we can
make this service available much more on demand. What you end up getting if you go
into a university, for example, and you ask them, "What would it take, accommodations,
for a deaf or hard of hearing student to just get captions in the classroom?" And it turns
out you have to give at least 24 to 48 hours notice that you need this, and you have to
be on file in advance with the disability service office just to be able to get the support.
And it's because they can't schedule anyone any fast. They're not always available and
they might be overbooked.
So we can actually make this service not only cheaper but more available by using nonexpert workers in this way. So kind of along the same lines looking at tasks that
individuals cannot complete themselves: we have activity recognition. The problem here
is not that people can't identify the activities just like they could hear the words in Scribe;
the problem is that they can't input the information to the system fast enough. So if I
want labels for everything that's going on in this scene right now, most of us would be
pretty hard pressed to be able to type that as it happens and as they switch from one
action to the next, especially if you went into action level not just activity level. So what
Legion AR did was it used the same type of process to run a lot of these -- You take the
same stream. We color code it so that workers can focus on an individual. And then, you
type as many of the actions as possible. Now this is a lot more general. We don't divide
up the task in any specific way according to time. We just let you watch the stream, and
you type as many tags as possible.
What workers will see kind of on the side of the video will be labels, and they see what
other people have typed previously, what the system's guess is because we're actually
starting to get to the point where we're integrating existing machine learning techniques
here in the form of a hidden Markov model-based activity recognition system. And that's
going to not only allow us to try to take a guess with the system and then have it
confirmed by people, but it's actually going to let us learn over time. So the crowd is no
longer providing its own stand-alone service; it's providing a training service. We don't
want to be monitoring people forever using just the crowd. What we'd like is when
something new happens we can bring the crowd in on demand. They can cope perfectly
fine. They know what payment looks like as this person reaches for their wallet or
leaving the room or whatever; they can identify this, all the different forms it comes in
when the system can't. So they give an answer. The system learns, "Okay, now I've
seen an instance of that."
And we're starting to work on systems that actually allow the crowd to give us a little bit
more understanding about the same problem where we don't just want to know what the
label is but we want to know what it means to have performed that action in a given
order. So, was it necessary to do one thing before another? And we start to get
dependency relations between all this. And again we can use that to train and existing
system, in that case, a lot faster.
So the system -- Sorry, go ahead.
>>: [Inaudible]. Does the crowd tend to converge on a granularity of actions?
>> Walter Lasecki: Yes. Although, what we were doing is we trying to focus them on a
granularity so we actually provide like two or three examples that might not be exactly in
the space. So for monitoring this room we might give kitchen labeling examples just
because that's what we thought of off the top of our head. But people are able to kind of
take the line "making breakfast" instead of "cooking eggs in a frying pan" or "lifting pan"
or something like that, and they can map that on to the new domain. So we prompt them
a little bit to tell them what we want but, yeah, they are able to converge. And they're
able to converge to consistent labels because once we have kind of a guess put into the
system we actually let the system propose as a worker. And I'll get to that a little bit later.
But they propose as a worker, so between sessions you can make sure that you say
"making breakfast" instead of "cooking eggs" because that's kind of the label that was
agreed upon. And people tend to not change the answer unless there's something
wrong with it. If it seems like a viable answer, they'll stick with it.
So in all these systems that we've looked at so far we are able to give natural language
commands to the system whether that be in the case of these examples for activity
recognition or directives on what to control or what the goal is for Legion in the control
task. We can actually capture people's speech using something like Legion Scribe. But
what we don't have at this point is the ability to get feedback. So we want the crowd to
be able to clarify a task, clarify what they're supposed to be labeling. In Legion if they hit
a blockade and they can't go past while driving a robot, let's say, maybe we want to let
them request or ask the end user, "What do you want me to do here? Should go find a
different goal? Should I keep searching? Should I give up and save battery?" Because
especially dealing with users who may under-specify the problem, the whole point is that
you don't have this formalism up front. We want to actually let people just naturally
request things.
So the question is can you actually get natural language feedback with a crowd? It
makes sense that you could but you, again, don't want six or seven or a hundred parallel
conversations. You want one conversation. You want one clarification process and often
in real time. So to look at this we made Chorus. Now Chorus is a system that actually
allows us to do exactly this: it allows us to have conversations with the crowd. And it
does so by having multiple workers in the background all proposing and voting on each
other answers so that the crowd is actually self-filtering in this case. And what you see is
the conversation that the end user sees. This is actually a hybrid view. So everything in
white here -- I don't know if you can see that -- is something the crowd said, but
everything in this conversation appears to end user, appears in the final conversation.
To make sure that workers are consistent and are actually able to be onboard it since we
don't when people are going to connect and going to leave, we also provide them with
this working memory that lets them kind of pass notes to each other about what
important things have happened in this conversation.
So this is how we can maintain consistency across multiple sessions. But how we
maintain consistency with multiple workers who might want to have different
conversations, who might have different ideas of the solution is, is we actually have a
backend that lets people vote on specific answers. And there is a point and monetary
reward system that just uses a very simple incentive mechanism to make sure that
people actually get bonused for contributing useful answers.
And this bonusing actually allows us to do two things: one, it allows us to select just the
consistent answers. And I know it's kind of hard to see that these are pink. But just the
ones in blue boxes were actually accepted. And so we have the ability to vote on these
things, but the ability to actually reward people dynamically for their contributions allows
us to pay workers appropriately within this continuous model. So you might do a
completely different amount of work. You might stay for a completely different amount of
time than another worker, but instead of kicking you out and making you take a new
small task that asks you post a response or something, we let you stay attached and we
pay you based on how much you do during that time.
Now what we tested this on -- And this is going to appear at UIST this year -- is actually
conversations with a dozen users who are asked to come in and kind of follow the script.
But the underlying task was a search task. They were trying to find a piece of
information. They were trying to plan a trip. They were trying to just maybe even have a
general purpose conversation to help them make a decision. So one example is, "Should
you get a dog in an apartment?" This is not something that you can just kind of go banter
with an existing system, like you can't talk to Siri about if you should get a dog. It will
simply search that question.
And it can adapt and you actually get a lot of very interesting serendipitous information
finding where because we have multiple workers they'll come more information than they
would have as an individual. And a lot of times that information is actually more useful
than the answer to the question that was originally asked. So as an example in testing
we saw one time the person was asking, "What are different places to visit in Houston?"
And after they got a list they said, "Great. How much does each one cost?" So they
wanted a set of prices to come back. What one worker ended up finding and the group
promoted and passed through was actually a city pass apparently that gave the user all
of the same activities that they were looking for plus a few extras and was cheaper than
giving the individual prices.
So the system didn't just do what we asked but the workers were actually able to figure
out that what they came across was a better answer and proposed a response to a
slightly different question. But it's not just a conversation partner. It can actually act as an
interface. And we started to look at applying this to other domains where we want
continuous interaction with VizWiz. So VizWiz is a system by Jeff Bigham, et al -- like ten
other people -- that allows blind users, from their cell phone, take a picture, ask a
question about that picture and then get a spoken response from the crowd. It's actually
voice over that speaks the response. The crowd just provides text. This is actually an
extremely popular system right now.
We have had this out for, I think, two years. And it's answered over sixty thousand
questions and had five or six thousand users, I think, the last time I checked. And you
can see the types of uses that people have for this. So maybe it's trying to figure what
denomination of bill they have. Maybe it's reading what the stove is set to or reading the
nutritional information off the back of a can. And while this system has been extremely
successful, there are situations where an image alone is just not enough information. It,
as you can imagine, is very hard to frame images for a blind user. So they might take a
picture of part of a package, ask a simple question: "What are the cooking instructions?"
in this case. And the crowd gives a response like, "Well, the instructions are on the other
side."
Okay, fine. They can't tell so they flip the box over. There are no real tactical clues as to
where this answer is. So they flip it over and give an image. Now a completely different
crowd of workers, a completely different set of workers gets this question. They don't
know that the user was just on the other side. So they look and they say, "Well, there are
no cooking instructions in here. Flip it back over." And this can go on. And I think was
almost a 15-minute interaction. And yet, it's such a simple question. If you had another
person next to you that you could interact with, you could very easily get this answer. But
because the tasks are broken up, because the crowd does not have any continuity
between it, between the different groups answering, it becomes very difficult to answer
even simple questions.
What we did was we built Chorus: View which essentially took the conversational
interaction of Chorus and linked it to streaming video. Much like having someone on
Skype, you can actually have a conversation about some item. So you look at this here.
What the worker sees is a continuous video stream. They can take some screen shots if
they need to kind of look at something a little bit closer. But then they see this
conversation on the side and it's just chat messages only the user is going to interact
with the crowd via voice. So they get played this little message, hear the question and
then provide answers. And you can say that this one was voted in but this one was not
because one person already found the answer while the other just couldn't read it.
Again kind of going back to the serendipitous discovery, it turns out that a group of
workers even with low quality video streamed over the web can do a very good job about
retrieving this information because it turns out that somebody takes the screen shot at
the right time or somebody sees a piece of information or can interpret a piece of
information or recognizes the package in some cases. So what we were getting are
responses that we, as the designers, couldn't have answered from video but some crow
worker was able to.
And then, you can see that this type of interaction allows the user to take a picture -well, not take a picture but start the stream, aim somewhere on the package and then, if
it's wrong they get an adjustment, "Okay, well rotate the left." Or maybe it's, "Rotate until
we tell you to stop." Maybe it's, "Rotate 90 degrees." Once they do that, we're using the
same crowd and the same group of individuals who know what just happened, know
what pieces of the package they checked and can kind of get more perspective because
over time you can understand what you're looking at even if you can't make out the
specific piece of information and you get responses a lot faster. And this is going to be at
ASSETS this year.
So what's next? Well we want to finish kind of formalizing this model. You can think of
this in terms of being a dynamic system or modeling this using control theory. But we
want to know what can you get out of this crowd agent model if you formalize it? Maybe
people are too uncertain but you can still give some error bounds if you assume
something about the workers themselves. And hopefully this isn't a very flat assumption
like a lot of incentive models are where people will strictly optimize for some fact.
And then, we want to start really deploying these systems in the wild. We have been
able to use real users to test these things but we haven't just seen what happens if you
give people an intelligent interface. And I don't think anybody really knows yet. Right?
We've always had these very limited-scope, very limited-release systems. Siri at least
had a large release but only handled something like 15 functions, and they're reasonably
straightforward.
And then, we need to address some of the security and privacy concerns that this brings
up because if you look at all of these systems, you know, we're streaming video to the
crowd. We're giving personal information maybe about location so we can find a
restaurant via chat. Again, this is video. We're streaming audio to the crowd. We have all
this visual data even in the form of images which might contain information that you
didn't want to share, either personally-identifying information or something like catching a
credit card in the background of a picture or catching something about maybe a mailing
address because there's a letter left on the desk here. So it's very easy to capture
information that we maybe didn't want to make public. So that's what I've been working
on this summer. And I've been working Jaime and Ece on this.
And we started by really just wanting to know what are the threats. If you think about
attacking a system you would need one malicious worker to be able to find out
information from a task. So if somebody accidentally captures their credit card, it only
takes one person to corrupt that. But if wanted to actually give the wrong answer, if
somebody takes a picture of a can of peas and you want to make sure that that person
really thinks it's a can of corn for whatever reason -- which might actually matter for
Azure reasons and all this other stuff -- what you really need is a majority. So there are
two different manipulation types but where can we find workers for this? And I should
point out that this is to find a piece of information but finding the task that contains that
piece of information is much more difficult.
You probably have kind of a needle in the haystack problem. So you want one worker
per task maybe but you need a lot of workers to find those tasks. So in either case you
have a bigger benefit from having a group. And one of the easiest places to recruit a
group of users is the crowd itself. So could we actually attack crowd-powered systems
with the crowd?
So the first thing we look at is how can we extract information? And we want to go and
we want to get that credit card. I've stopped using my own and I want to use somebody
else's. And we've set up this task where when the worker gets to the task they are asked
to click this big button. That immediately forwards them to a task posted by a different
requester, and that's just a very simple like, "Tell me what's in this image." Maybe it
doesn't even know. Maybe the requester of this inner-task doesn't even know. This
target here is being attacked. And we want to know, can we get people to return us the
same information to the attacking task? So the first example we looked at was this
hopefully very easily identified fake credit card. It's not real. It's drawn. The number is
12345; nobody is going to really confuse this with someone's actual credit card. And we
want to know, are workers more or less willing to answer this question compared to
something that looks like a real card. Now this actually is not a real card but they didn't
know that. And it looks pretty real. This person is holding it up. It's even got the hologram
and everything. So what do you think? Like, can we get workers to do this malicious
task? Or are they somehow perfectly ethical?
So to get the ethical task, it's kind of complicated. We have a lot of information here that
we copied out. Sixty-two percent of workers will successfully give us all the information
we wanted. But as soon as we make this a potentially harmful looking task, it drops to
32. So that's kind of interesting. So workers are paying enough attention to this task and
are making some assumptions obviously about maybe we aren't linked. Maybe we don't
own both tasks, right, because we certainly didn't post the same requesters. They have
no way to know that we're the same person. And obviously they think that there's harm
or at least this appears to be the case. Now what if instead of taking information out, we
want to put information in. So what if I just wanted to give the person the wrong credit
card number or maybe translate the word incorrectly?
I believe reCAPTCHA had a problem with this, with people just typing in other words. So
now the attack task is even simpler. It says, "Don't come back and tell me what they
said. Just tell them that this word --" Can anybody tell what this word says? Anybody
want to take a guess? Okay, so it's a little ambiguous. It could easily be gun or fun or
sun or lung or fir. Somebody -- You know? Sun. So, "Tell the person requesting the task
that this is sun." And I kind of, you know, can see that. Sure. On the other hand our
second condition was, "Look at this word." And I'm sure more people can make it out.
"Also tell them that's sun." Right? It's basically the same. So 75 percent of people were
willing to say that this was the word sun. It seems kind of viable until you consider that if
it wasn't suggested only 12 percent of people actually said that this was the word sun.
Most people actually thought it was gun.
Okay, so they're suggestible. Maybe we [inaudible] a little bit. Maybe we convince them
that this was the right thing to do. Maybe they were just willing to listen to us because it
didn't seem harmful; it seems like that's a plausible answer. Not as many conditions over
here. They can't really think that this is a viable answer. And sure enough we see we
dropped it just under 28 percent of people who were willing to give us the answer we
asked for when it's clearly incorrect. So this is on some levels promising. Right? It means
that there are workers out there who will see this as being a kind of ethical decision to
make between these two different tasks and will respond at different rates.
Or maybe they don't want to risk it; they want to do what they're told by the inner-task.
Let me stay there for a second. So that was interesting but we still have about 30
percent on each of the crowd that's willing to do whatever we just asked them to do. So
what can we do about that 30 percent? What can we do to safeguard against that part of
the population? And the typical way to post a task is you start -- you have some
potentially sensitive information. You send it to random people on the Internet and you
had the potential for bad things to happen. What we would like to do instead then is pass
this sensitive information -- Let me try to handle this -- to a filter. And then, using that
filter out all the pieces that we don't want the crowd to see, come up with a more
anonymized task which is then passed to people on the Internet and we get our task
solved without having to sacrifice privacy.
Okay, so as an example of that let's take this real simple little piece of text. Let's pretend
this is hand-written out or whatever and, "Hi, Bob. Here's my username and password.
Alice." Now anyone who has been anywhere near a cryptographer, cryptology class,
anything knows that Alice talking to Bob is a recipe for disaster. It never ends well. And
there's always Eve who is where somewhere off to the side listening in waiting to steal
that information. The problem here is that we can't just encrypt this because Eve is
mixed in with the rest of the crowd and we don't know who that eavesdropper is and who
is just trying to help us complete our task. So what we'd like to do is anonymize this
information. We take out all the sensitive pieces and now Eve has nothing to do. The
problem is this also potentially limits the crowd's ability to solve the task. But if I get a
response that says, "Well, I've parsed your piece of information. You have two facts in
here." And it says username and password; both are stars
Well, that doesn't really help me as now the end user. My information was processed
correctly. Kind of my anonymized information was parsed and now I got something back
that wasn't as useful to me because I don't know which pieces were used here. So we
go a little bit beyond just filtering, and what we want to do is actually replace the
sensitive content with semantic tags. And this will hopefully allow us to track each piece
of information with the system on the backend. And then, when workers actually refer to
a piece of information it's like, "Okay, well, female name, username is such and such."
Now the system can repopulate it before the end user sees it again and they can kind of
see with some degree of security exactly what they expected to when they put
something into the system. It comes back as, "Well Alice's username is AStern." But the
crowd doesn't get to see that. And we want to do that by actually dividing this up into
small pieces maybe only giving each crowdworker one word at a time and having them
slowly filter and complete the task -- filter and replace these semantic tags as they go.
So if I see a name, I can label that as such. And if I see -- Maybe I don't know this is a
username until it backs out a little bit and we get increasingly larger context for a piece of
information and we filter it as soon as we can. All right. But even if we filter out all the
information -- This is kind of going a little bit beyond the internship here and beyond what
we've previously discussed. So even if we can filter out this information and we have like
an anonymous crowd who can give us responses that are good responses, they don't
reveal any personal information, we still have a scaling problem with crowdsourcing.
Kind of in the extreme, imagine if everyone on Earth had a crowd-powered system
running from their phone. Where do you find the people? Where do you find every
person's system? Even with timesharing, it just doesn't work out, right? Now that's pretty
far in the future, but even for cost reasons we probably want to minimize the amount of
people we hire for these tasks. So can we actually increase automation? Can we use
this same model that has allowed us to stop thinking about the crowd as a collection of
noisy workers and just look at it as a single entity that gives us a reliable answer to
actually provide training data or maybe even go beyond providing training data and
provide understanding structure to the final system.
Now within the crowd agent model itself what we can do is when we look at this general
process, right, we have a bunch of people -- a bunch of entities -- completing tasks and
they're doing so in a continuous system. But the model itself doesn't actually know
whether this person is a human or a machine. So if we can start replacing some of the
contributors, some of these noisy contributors that we're assuming with noisy automated
processes, even if they get the answer wrong our expectation is that the system as a
whole, by using agreement with human workers who can kind of catch these errors,
maybe provide good input themselves, we don't have to in this merging process
propagate that error. And what this means for an interactive system is that we're able to
use the system, use the system with maybe flawed or only partially capable AI in the
background so we can complete a piece of the task but not the entire thing, and all the
noise generated by that, all the wrong answers are removed from the system. They
never corrupt the end user's experience. So even if we have to pay a little bit more every
once in a while for more people, we can do that hopefully as little as possible and we
can do that in a system that never looks like it makes a mistake or at least has a
reasonable error rate let's say, whatever we expect from humans.
So just to briefly recap here: I started by talking about the background of crowdsourcing
and how interactive systems can go beyond this kind of divide and batch process model
that has been mostly completely pervasive so far in crowdsourcing recently. And then,
we talked about how we can really use this crowd agents model to go beyond just simple
interactions of the crowd, become more interactive with our systems, actually power
intelligent systems that can speak back, can have conversations and can be consistent
across long ranges of time, and how that starts to point to kind of truly intelligent systems
that we can put into the wild.
And right before I finish up here, I promised I'd get back to this quote. So this is kind of
interesting because I think this points to kind of the interesting future of crowdsourcing or
something that we as a community should look at going forward. This is actually wrong.
Empirically from years and years and practice, managers don't think this at least not any
more. If you oppress your employees, if you take each one and say, "I'm going to
micromanage you so that you never waste a second," and you put these systems of
control with a narrow span of control on workers they don't actually more, they're not
actually more productive. And McGregor didn't actually believe that this was the answer.
He believed this was part of a spectrum, and kind of the other side of this was what he
called Theory Y. And that says workers are self-motivated. Workers want to be doing a
good job; they have some pride in their work. And what managers are supposed to do is
not put them into these rigid structures of control but actually just enable them. And I
think for people working here that's certainly closer to what their day to day experience is
than being told every minute of the day what to do and maybe having to file TPS reports.
But I think just now crowdsourcing is starting to get away from this thought and come
into this belief that workers can be leveraged in different ways, can be leveraged maybe
as more self-motivated employees if you just compensate them fairly. It will be
interesting to see what the crowd community can learn not just from management
science but also from cognitive science, psychology, operations research and all these
other fields that I think have set a lot of the precedent in the same type of work but in a
different domain where now crowdsourcing can come and extend this to more complex
tasks, more dynamic tasks and really change the way that we do business.
Thanks.
[Applause]
I'm happy to answer questions. Yeah?
>>: [Inaudible] what if you had where the crowd collectively was initiating an attack?
Does the system have an ability to defend itself?
>> Walter Lasecki: So for example if you had [inaudible] or anonymous or somebody
kind of come out. Well, so if you're looking at the method for creating a private task like
the filtering that we were looking at, that's kind of going through a self-selected crowd.
So we're putting this on Mechanical Turk. If Anonymous wants to then try to corrupt
Mechanical Turk, it passes that back [inaudible]....
>>: [Inaudible] simply by doing volume. So you're crowd source is only so large
naturally. So like you say, [inaudible] or say a group at Reddit decided they were going
to try and attack this system, they could in theory overrun it by simply being the largest
group of users, by being 60 percent of the users. They could overrun any particular path.
Does it have a way to identify feedback from users to say, "This is wrong." And then, you
get enough of these things and you say, "Our source has been compromised right now,"
and then it flags the system or something?
>> Walter Lasecki: So it's possible that because we saw this difference where some
workers were willing to kind of back out of a task and they were actually passing up
payment to not complete the task that we saw. So either the extraction or the
manipulation is ample. It might be possible to ask workers to essentially flag information
like that. So maybe we get a very small percentage of the group coming back and
saying, "I see what other people are doing. I see that this is a problem." But then you
have to make it transparent to other workers what people are doing. In general if you
have to take over 60 percent or even just 51 percent of the crowd there's a certain cost
implied. So you can always just kind of [inaudible] up how big of a workforce. You can
also use known good workforces.
So if maybe I have small internal employee pool and then I'm using the crowd to kind of
do the really heavy lifting, the filtering task can be done internally or by more trusted
workers and not actually give access to those kind of group or to any service that could
be corrupted by those type of groups. That's one way.
>>: You showed that people are less willing to do a task if they think it's a bad thing like
the credit card numbers. Are they more willing to do a task if they think it's a good thing
like helping a disabled person?
>> Walter Lasecki: Right. So I did not add that slide. But we actually have initial data
kind of from survey results trying to look at if you give workers a set of tasks, it's all the
same of task but there are just little bits of flavor added. So, "This task will help a blind
user" or, "This task will help a large corporation," and looking at the fact that there is
seemingly a preference difference. People will go for the assistive technology
application. And certainly in experience we've seen workers go out of their way to come
back and ask us, "You know, if you have another task, let me know." So yeah. There is
some bias there and we're now kind of working on experiments to show that.
>>: Coming back to your last slide about how [inaudible] workers are [inaudible] around
1960's. I think if you take the works that people do in jobs also change [inaudible]. Like
people are more engaged. You kind of change from [inaudible] factory to maybe more
an intellectual job or something that keeps your mind occupied or keeps you engaged. I
don't know if there's a progression like that in crowdsourcing too where the tasks started
being these very small things for a single person who's not actually very intellectually
challenged by what they are doing. But hopefully we will give them tasks that will be
more interesting or engaging than being these small pieces. So I'm kind of wondering if
that'll have some kind of an influence too.
>> Walter Lasecki: So in my experience I think it will because if you look at some of the
feedback that we've gotten from workers in these continuous tasks where they can kind
of stay for longer periods of time. We're not constantly interrupting them. We're letting
them kind of invest what they think is necessary in the task. So if you want to go and
really do a lot of research to bring an answer back to Chorus, that's great. And we'll try to
reward you for that. But it's not necessary. Like, you could just give the basic answers.
You can opt in when you want to answer and when you don't want to answer, that kind
of thing. We give a lot more freedom. We give a lot more kind of providence over the
data, I guess. You're really not seeing the individual worker contribution, but the workers
themselves see that they get more points and things like that. So there are, I think, a lot
of different levels of engagement and a lot of different levels of commitment to a specific
task because you can stay for a longer span. So that hopefully is even supported by the
continuous model. But, I agree. Yeah?
>>: When you're using the crowd agent to train the hidden Markov model for the video,
did you find that to be a reliable method of training?
>> Walter Lasecki: Well, so what we were doing there is actually having the crowd
come up with label sets that were exactly the same as an expert would have. So they're
basically -- It's a correlated time range with a label. So because -- And I kind of
mentioned this in my answer to John. Because we can get consistency across what we
were calling the activities by taking a guess with the system and trying to like pull people
towards consistent answers, then we could kind of re-label making breakfast every
single time we saw a different type or different course of action that went into making
breakfast. We were still able to come up with the same level. So the system could get
trained. It didn't have like 400 different labels for a single activity. So then, at that point it
just kind of converges to what an expert would do right now. But we can do it online
which is [inaudible]. All right.
>> Ece Kamar: Thank you again.
[Applause]
Download