>> Ece Kamar: It's my pleasure to welcome Andrew...

advertisement
>> Ece Kamar: It's my pleasure to welcome Andrew Mao back to Microsoft Research Redmond. He's
been interning with us last year, and now he's an intern at the Microsoft Research New York lab.
Andrew is a fourth-year Ph.D. student at Harvard University working with Yiling Chen. He's interested in
applying empirical methods for understanding human behavior in online settings. I'm giving you
Andrew.
>> Andrew Mao: Thank you very much, Ece. Today I will be talking about a couple of projects stemming
from my internship last summer here and basically understanding better the connections between
algorithms and human behavior in crowdsourcing. I'll just start with a really simplified dramatization of
a crowdsourcing system. We have both paid and unpaid types of crowdsourcing. When people come to
the crowdsourcing system, we basically are in need of their time and their effort. We either
compensate them in financial ways such as payment or in other ways such as just common interest or
altruism or social behavior. There are some fundamental questions in this type of environment because
it really matters to us how much work that the workers produce. It's not just -- we can't just pay them.
If we're not paying them, how do we keep them interested? How do we measure how much work
people are willing to produce, and how do we characterize how they interact with each other in the
system and outside of it? These are some fundamental questions, I think, for crowdsourcing that need
to be addressed. First of all, why do users give us their time? If we're paying them, how can we get
them to spend more time with us? If we're not paying them, how do we keep them interested and
measure how much work they're able to produce? If we're paying people, how exactly should we pay
them? I'll go into this in the second part of my talk, but it's not clear that the current systems for paying
workers in crowdsourcing are optimal. Finally, how do we design systems that can leverage the
interaction and interactivity between users, building crowdsourcing systems that are more social and
take advantage of the fact that humans can actually do more when operating collaboratively than
independently? The approach that is kind of guiding all the work I'm talking about today is that in order
to build better algorithms and systems for crowdsourcing, we shouldn't just be modeling human
behavior. There's been a lot of work kind of describing how or making assumptions about how humans
in crowdsourcing systems work and how we can model their behavior, but we can actually understand it
and use this, use the behavior to make adaptive algorithms that take advantage of different types of
behavior. I'll be going over several examples of that today. Let me just give you an overview of my talk.
Two of the projects I worked on last summer, starting with an empirical evaluation of different types of
engagement in a volunteer crowdsourcing setting. I'll also show you another project where I look at
different types of payments and how workers are biased and how this compares to volunteers in a
crowdsourcing task. Then finally, if we have time, I'll go over different future directions in studying
interactive crowdsourcing between different groups of workers. Let me start with an example of why
it's useful to study engagement in crowdsourcing. Galaxy Zoo is basically a project that started over four
years ago -- actually five years ago, in 2007. The idea is that in the Sloan Digital Sky survey, a telescope
is pointed at a lot of different stars and has taken pictures of many of the galaxies in the universe. The
idea was to take a lot of human workers and basically get a better idea of what type of galaxies exist in
our universe and get a better understanding of the stars. This is about the simplest type of
crowdsourcing task you can think of. In its first iteration, workers basically classified the galaxies into
one of six types. They basically clicked one of these buttons here, and basically were repeatedly shown
pictures of different galaxies. The idea is that once we have a majority vote on the classification of a
galaxy, we have a consensus. Surprisingly, in spite of this task being really simple, people have spent a
marathon amount of time on this task. The longest amount of time that people have done this task
without leaving for more than 30 minutes was 17 hours long, and they went through almost 12,000
different galaxies. I'm not sure who that person was, but they are the type of crowd worker that we like
to see. This leads to some interesting questions though, because we want to understand why people
are willing to work for this long and why are they willing to give us their time. One idea to look at this is
the idea of engagement or how productive people are going to be. How much time are they willing to
spend? How much work will they produce? What do the activities of engaged workers look like? Can
we distinguish between different types of workers in this system, and if we see workers in an online
setting, how much longer are they able to -- will they continue to provide us information? How can we
predict if they are actually going to quit or stop? This is just one definition of engagement which I'll
cover today, but there's definitely, I think, more room for nuanced ways to factor, to take into account
worker interest such as if they'll return in the future and things like that. Why do we care about this
type of engagement? Well, the idea is that if we can mod el the features that allow us to predict when
people are engaged, we can actually characterize what is actually driving participation. We can look at
patterns and recognize when people are actually willing to work and when they don't. We can estimate
the amount of work that our system has produced. We can actually take a look at the workers that are
interacting with our system and basically be able to estimate how much work we're doing. Are we
keeping people interested or not? If we're not keeping them interested, how do we improve so that we
can get more work done? More importantly, we can actually take targeted interventions or different
types of interactions to try to engage those workers such as new workers to the site that we'd like to
convert them into people that will kind of turn into old hands or dedicated users and basically provide us
a lot more work over the long run. Obviously, this type of problem is a lot more important for unpaid
systems, because we don't have the luxury of just giving people money in order to get them to stay. I
think this problem of taking an algorithmic approach to engagement is a natural extension of work that's
been done so far with applications to crowdsourcing. Ece other collaborators at Microsoft Research
have basically taken a look at decision-theoretic approaches for predicting how accurate a task is going
to be and making decisions for how many workers to hire in order to achieve a desired level of accuracy.
There have also been other decision-theoretic models for making optimal decisions in other types of
crowdsourcing for accuracy, but this has kind of not been studied as much in the context of engagement
and interest. There have been some types of attention models for different types of online
communities, and it's natural to take this decision-theoretic or algorithmic approach in order to build a
foundation for making these types of decisions. Let me just give a basic overview of the model that we'll
be using to study this problem. We have different users in our system and they can basically work on
tasks -- yes, go ahead.
>>: Maybe my question is a little bit ignorant, but on the related work, I assume that they had some
kind of models of users, crowd users, when they have this decision-theoretic model, right, otherwise
how can they determine how many workers they can hire?
>> Andrew Mao: Yeah, exactly.
>>: Are you saying that those models are not adequately capturing the real people's engagement?
>> Andrew Mao: Those models are primarily used in order to make predictions about the accuracy of
tasks. I'm saying that we can, we should build similar models in order to take into account the
engagement of people working in crowdsourcing. Does that make sense?
>>: Those models assume that once people are assigned a task then they will finish it?
>> Andrew Mao: Yeah. Then you also make some assumptions about the distribution of workers that
you have. You don't take into account things as how long people are working and how many times
they've come back and things like that. I think these models together in the future will naturally
integrate into a much larger system.
>> Ece Kamar: May I just add one thing? Those models are only reasoning about your immediate next
task which will probably take some time from five seconds to a minute, but they don't really reason
about your longer-term contribution to the crowdsourcing system in the next half an hour, an hour.
That's where we are trying to get.
>>: So those are kind of short tasks then?
>> Ece Kamar: These are called micro tasks and they take very, very little amount of time to finish.
>>: All right. Thank you.
>> Andrew Mao: By the way, feel free to interrupt if you have any questions at any time. We'll basically
take a look at a type of crowdsourcing given the tasks that I described to you, such as simple
classification tasks, where there are different workers and over time, they complete different groups of
these tasks. They can take different amounts of time on each task, and they might take breaks where
they're not doing anything at all. They can either take short breaks or longer breaks which they'll come
back in the future. We'll group these sets of tasks into different types of sessions. What I've shown you
here is a contiguous session, and the idea of this is to capture groups of tasks that people do in a
continuous amount of time on the site. We can further take these sessions and group them into
aggregate sessions, and the idea of this is to capture amounts of time that people spend including short
breaks but without completely leaving the site. This is like bathroom breaks, pizza deliverymen, and
things like that. At some point in time, we basically have the problem of we know everything about the
worker so far, including the task they're doing and everything in the past, and we want to make a
prediction about when these workers will finish their sessions. When will they stop working? This is the
type of problem that I'll been looking at today. There are different types of hypotheses under this
model, right? Naturally, as people work longer, we're going to think that they're more likely to quit,
right? At the same time, we can do more nuanced measurements of what people are doing, so if we see
that people are spending more and more time on each task, they might be getting less interested
because they're distracted, or they may also be becoming more careful and learning how to do the task
better. If we see extremely short amounts of time on each task and people basically making the same
decision all time, we might believe that they're not really paying attention. We can also say that
perhaps people might work at the same amount of time each day, so if we know what time they're
coming, we can make predictions about when we think they'll stop. Importantly, all this information is
probably more powerful when we take into account past performance and what workers have done
before, both for the same worker and for the group as a whole. Let's take a principle approach to
solving this problem, so I'll start with some really interesting aggregate data that we've gathered from
Galaxy Zoo. What I have here is a graph of the cumulative number of sessions that return after some
amount of time. So given that someone has stopped working, for example, 40 percent -- almost 40
percent of people come back after 30 minutes. You can see that these are different things. 10, 15, 30
minutes. This is a log scale on the x axis. Particularly, you see some interesting patterns here in that at
one day and two days, there's a noticeable spike in this graph. There's kind of a much denser group of
people, sessions that are restarting after 24 hours and 48 hours. If we look at what this looks like for
just a two-day period, you can see that there's a distinct pattern of workers that come back after one
day. This is actually a really interesting type of distribution that we're seeing here, because after over
one day, we kind of have an exponential drop-off in the number of sessions that return, but we also
have this bathtub distribution that occurs over multiple-day periods. This is kind of saying that there's
really promising ways to model this and basically get a better handle on what people are actually doing.
We'll basically take a machine learning approach to this problem, so every time someone completes a
task, what we'll do is create a training instance based on their history and everything we've seen so far.
For each of these tasks that someone completed, we'll ask the question, when will this person -- we'll
use this to make a prediction about the future and compare it to basically how this person performs.
The idea is that we'll have a temporally consistent approach here, so we'll use some amount of data as a
training set and then use the data that is further in the future as validation testing. This is important
because we don't want to contaminate this data with training instances that came after it. We've
generated a pretty large number of features for this problem, and I'll just go over some of them here
and why they might be important. First, we can just take a look at what people are actually doing in
each session, how many tasks they've done, how much time they've been spending. What is the
average amount of time they've been spending on each task? What is the entropy of the work that
they're doing? That might be a measure of if they're making the same decision each time or if they're
paying attention. We can look at the tasks they're actually working on. How much do they agree with
previous workers that have been doing this task? How much of their vote's been changing? Finally, we
can look at their history. How much have they worked for us before? How long do they typically work?
What time are they coming on the site versus when they usually come? And look at how long it's been
since they last worked for us, so notice that all these features that we're focusing on are not domain
dependent. They kind of work for all general types of classification tasks. Galaxy Zoo just turns out to
be a great set of data to study this problem, but the idea that we're motivating this type of study for
more general types of crowdsourcing where you can generate these features from data. It's important
to make a definition of, to define what exactly is disengagement. We have a prediction problem, and
what's the label that we're trying to predict? Well, we'll focus on at the binary prediction of problem of
if someone's session will end within a certain number of tasks or a certain amount of time. There are
still multiple ways to do this, so we can predict whether they will end for a short break or a longer break,
and we've kind of used these definitions of five minutes and thirty minutes based on the distributions I
showed you earlier as a proxy for these types of short or longer breaks. We can also make predictions
by looking at how many tasks or how much time workers will actually spend on the task. We can
basically -- yes, go ahead.
>>: For those threshold minutes to design contiguous sessions and aggregate sessions, have you looked
at any other tasks than the galaxy that you collected traces from to confirm that these will hold for other
types of tasks?
>> Andrew Mao: No. I don't have another type of task which I can show. We kind of picked these, I
guess, looking at the distribution of work on Galaxy Zoo. I think what I'm trying to show today is what
can we predict from this task and how it might generalize to other tasks, so I don't have a clear answer
to that question. What I'm showing you here is basically the accuracy of this prediction as measured by
AUC or area under the ROC curve. The idea of AUC is how well does your predictor, how well is your
predictor able to tell a random, negative, and positive instance apart? If someone will leave within 20
tasks, how many percent of the time will we get to a comparison of those two things right? There's a lot
of graphs here, but let me try and explain what's going on. On the left, we have prediction by the end of
a contiguous session and on the right, prediction by the end of an aggregate session. We can see that
the ability to tell disengagement from continuing to work is actually easier as we look at aggregate
sessions because basically this is kind of a more meaningful measure of when someone will actually
leave. Additionally, as we increase the time interval for prediction, if someone will leave within 30
seconds, one minute, two minutes, five minutes, the accuracy also increases. The ability to make this
prediction increases, but at the same time, if we predict that someone is going to live within 30 minutes
or in a more extreme example, one hour or two hours, that's not a very useful prediction to make
because it doesn't allow us to take any actions as a result. So what we've basically looked at is chosen a
label where we look where the aggregate session will end within 20 tasks, because this label actually has
the highest correlation with all the other labels, and it's a meaningful tradeoff between the usefulness of
the prediction and the accuracy. I can give you an idea of what these types of models look like when we
compare them to each other. On this axis, we're predicting the probability that the session ends in five
tasks, and this axis is the probability that it ends in 50 tasks. The blue squares are the training instances
which actually ended within 50 tasks. As you can see, these models are actually predicting pretty
correlated values. They're not exactly correlated, but they're very similar if you kind of transform this
function in the probability that people will actually leave their session. A natural question to ask is given
that we are making this prediction for the entire site as a whole, how do the different sets of features
matter? I described to you different types of features such as the tasks features, the session features,
and the user features. As we add more features, including the user history and the session features, the
accuracy of our prediction increases, but also, we might be concerned about different subpopulations
on the site. If we take a look at this model and apply it to segments of the population, such as only
those users with allow amount of history or the users with a high amount of history, we see that it
actually makes better predictions given more history, and of course, that makes sense. Additionally,
when we take into account user history for people that don't have much -- the user features for people
that don't have much history, the improvement is much less than when we have more history about the
users. Another question we like to ask though, given what we see here, is does it make sense to train
models for specific subpopulations on the site? For example, if we're interested in targeting only users
with a low amount of history, can we actually get better performance if we train a model just for those
users? The answer that we've seen from our data is the general model is rich enough to capture the
performance of -- is rich enough to capture predictions about the entire site of users as a whole. What
we see here is that if we train a model that is specific for users with lots of history, it actually doesn't do
any better than the general model we've trained on the entire population. This is actually a similar
result for users with low amounts of history. This is particularly interesting because usually we want to
be able to have a rich enough model to capture information about users on the entire site, and this is
saying that if, for example, we want to make predictions just about the users with little history, because
they're kind of new and we want to convert them to more dedicated users, we can use a more general
model. Yeah, go ahead.
>>: [indiscernible]
>> Andrew Mao: What I've basically taken here is like the top and bottom quartile of users by how
much work they've done in the past.
>>: How does the situation look like if they're like new people, like they have rich history?
>> Andrew Mao: These are basically 25 percent of the users, the top 25 percent and the bottom 25
percent of users and basically all the instances that were generated by those users. We use those in a
real -- one of the motivations of this model is to be able to use it in an actual crowdsourcing system.
That means if we start something and we basically start asking people to do work, how much data do we
need to achieve this level of accuracy about the prediction? One way to test this is under the temporal
model I showed you before. You basically train with increasing amounts of data and see how long it
takes for it to converge to the optimal level of accuracy. If we take a look at different amounts of
training, such as one day, two days, three days, a week, two weeks, and a month, we can see that
there's actually a lot of data here, so much that it actually takes about a month of different data before
we reach the same amount of performance as in the general model. This is saying that basically if you're
starting a crowdsourcing site fresh with users that you've never seen before, you should retrain your
model until, in this case, for about a month until you reach the same level of performance. Another
question we might want to ask is a how well does this model generalize over time? Given the amount of
data that we have, how much variation is there in how well we can predict the engagement of users? In
this case, what we'll do is take different amounts of training data and test over different segments of the
future to look at how well the model makes predictions. What we see here is that the performance of
the model actually does not change that much over time. It's slightly dependent on the set of data that
is generated by the users, but there is no clear upward or downward trend; it's not getting more or less
accurate. Given that we've kind of reached convergence of the performance in this model, it's not really
necessary to be constantly retraining it in the future. What are some of the most important features in
this model for making these predictions? As you would expect naturally, looking at the number of tasks
that people have done in their previous sessions and also in their more recent sessions for making those
comparisons is important, but also looking at the amount of work they've done in their current session
and comparing that to the median of the work that they've done in the past allows us to make these
predictions about if someone will leave or not. Also, how long they've been a user on the site is also an
important feature, and it kinds of allows our learner to distinguish between different groups of users on
the site as you saw in the previous slides. What I've shown you here today in this project is basically an
attempt to measure engagement in a real system. What we learned is that using non-domain-specific
features, we have a meaningful statistic for targeting users that are about to leave. So if we look at the
top five percent of our predictions by probability, 54 percent of the time, those users will actually leave
their session. This model is rich enough to capture different subpopulations on the site, and I've shown
you basically which different groups of features are important for predictions. I'm not making a general
claim here, but more that it's important to study the different types of features, such as in the session or
the user history or the task itself, because in different domains those might have different importances.
I've also showed the amount of data you need to converge to optimality and how well the model
generalizes to future time periods. The future work that we're particularly interested in for this task is
to basically take this model of engagement and test interventions out on the users. This means doing
things like making the users more interested, getting them to come back in the future, and basically
converting the workers that are just visiting us into more dedicated users, because that will increase the
amount of work that we can ultimately get done as an unpaid crowdsourcing system. Some of the ideas
here are, for example, showing a tutorial for the task for people that you think might be confused and
disinterested, showing them particularly interesting tasks that will keep them interested. There are
other things, so if we think that a user is doing a very bad job and we want them to leave, we might be
able to increase the probability of that happening by kind of giving, forcing them to do something else,
or even increasing the cognitive load that it takes them to do this task. These might be a little bit
dystopian, but it could be a way that an algorithm can optimize the amount of work that's done in a
crowdsourcing system. I'm going to talk about another project next. Does anyone have any questions
so far?
>>: Have you heard of Fold It?
>> Andrew Mao: Yes.
>>: They're using a game basically to crowdsource. Do you know about any kind of stats that they have
released, like the engagement of people? If you have a game, does it change anything of your findings?
>> Andrew Mao: I can talk about a little bit about that later in here, but this is, I guess, this is just based
on the Galaxy Zoo which is trying to study a classification task. I think that gamification and different
types of interaction could be really helpful for increasing engagement in crowdsourcing, and that's one
area I definitely want to study. The question I want to ask next is what is the right way to pay workers in
crowdsourcing? If you're familiar with Amazon Mechanical Turk, you'll know that there's basically a
canonical way of paying workers for each task that they do, and we basically accept or reject the work.
Is this the only way to compensate workers for their work? I'm arguing that this is definitely not the
only option. The particular payment system that we're adopting here, paying workers for one task at a
time, how does this affect them? It’s important to study this question and look at other ways you may
be able to compensate workers in paid crowdsourcing. Another important question is how do the
workers that we're paying compare to workers that aren't paid at all; that is, those in volunteer settings?
I am going to attempt to give a partial answer to that question today with a really interesting study
we've done using a crowdsourcing task. I'll adopt a similar model to what I showed you before, but we'll
basically study the context of one worker. What we basically have here is a needle in a haystack task.
Workers are looking at these tasks over the course of a session, and each task is basically a haystack
with some needles we're interested in finding. Some things look like needles but aren't needles. We
basically have some things that we want workers to find and some things that may be misleading and
we actually want people to ignore. What we want to do is incentivize people to give us good accuracy
on this task, to limit the false negative and false positive rate, but we need a realistic task to study this
problem, and we need ground truth data in order to look at what the actual accuracy of workers is.
What's the right way to pay people for this type of task? I'm going to focus in particular today on
methods that aren't performance dependent. We want to study what type of biases are created by
different types of payment when we don't actually have a ground truth or we aren't giving -- we aren't
using gold standard data to evaluate the performance of workers. So in addition to the typical way of
paying people for each task, there are other ways, such we might pay them standard wages, that in
normal economics or in this type of task, in the needle in the haystack, we can pay for each annotation
that people make, so each time they mark something we can make a payment. Or they might not be
paid at all. So if we have a similar type of task in a volunteer system, what type of accuracy and
characteristics can we expect from the workers there? Basically if we're paying people in these different
ways, the theory in economics predicts that people will respond accordingly to the incentives in the
payment. Even if they're not completely economically rational, there are still going to be effects based
on how they're getting paid. In these payment methods I'm showing you here, we'd like to study what
different effects that they cause on workers. The prior work in this literature is pretty interesting.
People have studied not how different payment methods affect people but the amount of payment, and
the consensus from several studies is that the amount of pay actually doesn't really affect the quality of
the work. It can affect the quantity of the work and the quantity of the work done and the speed of the
work done, but in general the quality doesn't the change very much. Additionally, there's kind of an
anchoring effect for the level of payment, so if you're paying workers a given amount, they always think
they deserve a little bit more. There's also been a lot of work looking at different types of incentives for
getting people to give good quality work, and a lot of those focus on performance-dependent pay. So if
you have some gold-standard data, and you're compensating workers according to that, how much does
their behavior change? Although this isn't prior work, we're actually looking at how we can incentivize
people and the biases they get when the work is not performance dependent. . A lot of this is, I guess,
based on work in economics about how people respond to different types of incentives. The task that
we're going to basically study here is finding planets in other solar systems using the transit method.
There are these telescopes in orbit above Earth that are basically pointed at a bunch of different stars,
and they measure the brightness of the stars over time. The idea is we can actually find planets in
distant solar systems by looking at the brightness of the star. In here, you see a planet in orbit around
the star, and when it passes in front of the star and it's in the same plane as the telescope that is viewing
it, it causes the brightness of the star to dip. Basically by looking at these dips, we can find planets in
other solar systems. You can imagine that there are some nuances to this task. There's actually a
Zooniverse project called Planet Hunters where workers are asked to find these planets, possible
planets, by looking at what is called a light curve, which I just showed you, which is the measure of the
brightness of a star over time. This is kind of a pretty general interface for looking for these objects
where you can zoom in, you can scroll around, you can draw boxes in to mark different transits. What I
have here, this is a particular example of what might be a planet transit in this data. The idea of planet
transit is particularly interesting because you can actually simulate them with mathematical functions.
We can basically take some data that we've collected that has no planets, because basically the vast
majority of the stars don't have planets around them when take this light curve data, and we can add a
fake planet to this light curve. So we just imagine that there's a planet rotating around this star with
some orbit and distance, some period, and some size, and we can subtract this from the brightness of
the star and basically get some fake planets. This creates an interesting ground truth data that's very
similar to the real data for this task and also used by astronomers. You can imagine that, with the ability
to do this, we can actually vary the difficulty of the task. If we take a really big planet and put it in front
of a small star, we get what is a really easy-to-detect transit because basically compared to the variance
in the actual brightness measured by the telescope, this is a really big dip to spot. But we can also get
really hard transits as well. We take a small planet, put it in front of a big star, the brightness changes a
lot less, and especially if that planet is it going very fast, the width of the dip gets much smaller. This
actually gives us a great test bed in looking at how financial incentives affect workers when they're
working on tasks at varying levels of difficulty. The experiment we basically set up is kind of like a replica
of Planet Hunters using Amazon Mechanical Turk. We'll give people a group of light curves, and they
can basically draw different boxes wherever they think they see transits in this light curve. There's this
counter at the top of their screen that shows the method they're getting paid and it updates in real
time. Basically as they work through the task by whatever incentive method they're going to get, we'll
show their payment method and update it constantly. They can basically draw any number of transits
here subject to some limits which I'll discuss, and then they can either click next to continue to another
task or submit their work and basically fill out an exit survey. The point of this experiment is to basically
compare different incentive methods at the same average amount of payment. So if we're actually
paying people the same amount of money but in different ways, how do they respond to that? We can
vary the level of difficulty in this task, and we actually have the same tasks that were done by volunteers
in the unpaid Planet Hunters crowdsourcing system. Additionally, when people finish their work in a
series of light curves here, we can ask them why did they stop. What made you choose to stop doing
our task and possibly go to something else? The design of this experiment actually comes in three
phases. First, we looked at how the volunteers behave in the unpaid setting. We see that they work at
some certain rate, and then the tasks per minute, and for all the tasks that we're that we're looking at,
they make some number of annotations, which is transits that they've marked. Then, assuming that
people in a paid crowdsourcing system, Amazon Mechanical Turk, behave the same way, we calculate
the wages that we'll pay them such that they'll make $4.80 per hour. The reason that we picked this
number, $4.80, is if you're familiar with the Mechanical Turk ecosystem, workers try to target a wage of
$6 per hour, and that's kind of an acceptable rate at which they'll work an entire day on a given task.
We basically want to set a wage that's slightly lower than that so at some point they'll be compelled to
leave, whether due to disinterest or whatever other reason. Given that we've calculated these wages,
people will all earn $4.80 in all the different treatments as long as their behavior doesn't change, but of
course, we don't expect that it's not going to change, right? These financial incentives will definitely
affect them. So what we'll do is actually scale the payments that we're making so that the workers are
actually earning around the same amount of money. And the reason that we can do this, as I told you
before, is because the prior work on financial incentives is that the behavior in terms of quality and
some other features doesn't actually change that much based on the amount of payment, so we kind of
account for the biases in here. We can just scale everything accordingly and people will get paid the
same amount. What are some hypotheses of the biases that workers might produce when we compare
this to the unpaid setting? As predicted by economic theory, we're looking at how accurately they're
able to detect transits in these different light curves, planet transits. This is a kind of an ambiguous task,
right, because there's a lot of noise in the light curve and some planets are big and some planets are
small. Even if you're not explicitly cheating at the task, you might actually think that you found
something and kind of be more lax in your standards for your work. If we kind of think about these
incentives, if we're paying everyone by each task that they do as we normally do on mechanical Turk, we
might think that the recall is going to go down because they're going to be less careful about looking for
things. They want to finish as many tasks as possible so they won't look as carefully for transits, so the
amount of tasks will go up, and they'll spend fewer time on each task. If we're paying them for each
transit they actually marked without regard to whether it's right or not, we can think that the precision
is going to go down, because of all the transits that are marked, fewer of them are actually going to be
the needles that we're interested in finding. The recall might go up though, because they're basically
going to mark as many things as possible, and they might also do an increasing number of tasks in order
to make more money. Finally, if we're paying them just by wages, it's not clear what type of bias we'll
see here, but they may just spend a really long amount of time in the session and maybe even work
really slowly because their work is not really dependent on how fast they're producing results. Because
all these incentives actually give economically rational agents some pathological behavior, we've put
some pretty simple but very generous controls on the experiment. The idea of these controls is to kind
of limit the type of crazy behavior we would see from people that are trying to make as much money as
possible. Because of these different treatments we're running, we've put these controls on every
experiment: people had to spend at least five seconds looking at each light curve, they could only draw
a maximum of eight boxes on any given light curve, and if they didn't do anything for two minutes,
based on detecting clicks and different events in the browser, we would actually just give them a
warning, and then at three minutes kick them out. These controls were over the entire experiment, but
they were mainly geared toward the type of pathological behavior that you might see in these
treatments. They're designed to be generous enough so that only the most pathological workers would
actually hit these limits, so we selected our data that you wouldn't actually be able to hit any of these
limits, particularly in the number of annotations, by working on the task. Then we also limited the
overall amount of time in the task to an hour or 200 light curves, which is basically more than the
amount of time that we need to study. Let me just show you what happened when we ran the first
phase of this experiment. Assuming that everyone was behaving the same way, they should have all
earned the same amount of money, so of course, the people that were being paid by time earned $4.80
an hour. Does anyone want to predict how much the people that were being paid by task were earning?
How much did their behavior change to increase the amount of time they were working? Any
suggestions, in terms of like an aggregate average hourly wage for the workers in that treatment?
>>: Was there any additional friction to having to go get new tasks?
>> Andrew Mao: No. There was no -- what do you mean? No. There was no. All the tasks were in one
Mechanical Turk HIT. Basically we used that as the -- sorry, the conduit for our experiment. There was
basically no friction. You just click the next button and you get another one, so you can click that as
much as you want every five seconds.
>>: Are you asking for the mean or the median, because presumably they'd be different?
>> Andrew Mao: This is kind of like the weighted mean. How much did we pay people for how much
work they did divided by the total amount of time people spent?
>>: That would be the mean.
>> Andrew Mao: Yeah. The amount of money people were making when they were getting paid per
task was over $8 an hour.
>>: Presumably, a few outliers are going to really shift the mean.
>> Andrew Mao: This is shift -- this is basically weighted by the amount of time that people spent, so
the workers that basically spent only a small amount of time don't count much toward this. It's basically
the people that did the majority of the work here that we're looking at how much money they made.
Does that make sense?
>>: Still right, a few outliers are really going to mess this up, so the median is going to be.
>> Andrew Mao: I think you kind of have to weigh it by how much people did, right, because if you just
take a look at the raw numbers, people kind of worked on the task for different amounts of time, and
they're not -- you can't take the median of that, I think. It's weighted by how much work the people did.
So how much in aggregate did we pay people who are working in this method?
>>: As far as knowing whether this would hold up to any kind of statistical significance test, this could be
two workers who are the result of this shift, right?
>> Andrew Mao: We limited the amount of time for each session, so no worker is going to account for a
huge amount of this, no particular worker.
>>: What statistical test would you be running on this?
>> Andrew Mao: We actually didn't run a statistical test on this.
>>: We don't know if there's a real difference here.
>> Andrew Mao: I guess, but I'll show you some other results, and maybe -- I mean that's not the point
of what I was trying to show here, but I can -- let me go forward a little bit and see if I can answer your
question.
>>: I'm interested in your paying by time, and then you weight by the number of tasks. Like the 4.8, is
this just the average, just the average of all these people?
>> Andrew Mao: We basically take the amount of money we paid to workers in this treatment divided
by the amount of time they spent on this treatment, which is basically weighted. We are paying them
this wage, right, but for other people who are behaving differently. If you wager a guess at the amount
of money we're paying when people are actually working by annotation, you can see a clear bias here.
They are basically making almost $11 an hour, and some workers, like you said, the outliers, one person
actually made over $20 an hour doing our task on Mechanical Turk. That might be around the highest
amount of money anyone has made as an hourly wage on Mechanical Turk. What we did in the second
phase of our experiment is basically scale down all these wages for the other treatments so that people
were getting paid the same amount. We basically took the wage that we were paying people and
divided it so that everything would be around $4.80. Assuming that these are all -- all the workers that
we use for each treatment were unique, so assuming that we're kind of drawing from the same
distribution with a large enough amount of data, these wages should be similar, right? What we
basically see is that yes, the wages, once we've scaled them, we were actually able to get them to earn
within a dollar of each other, and in particular, this number is extremely close. The number of workers I
actually have in here is around almost 200 per treatment, and because we've kind of limited the amount
of time they're able to work in each session, I don't think that the outliers are contributing a lot to each
of these payment levels. We wanted to get some treatments where people were actually earning the
same amount of money so we can kind of make an apples-to-apples comparison of how accurate the
work that they're doing.
>>: What's the N for each number? It's like over 1,000 or?
>> Andrew Mao: Basically the amount of workers in each of these is, sorry, 180, and the number of light
curves is about 25,000. Sorry, I should have put those numbers in here. Let's take a look at what
happens at the precision, or basically the number, the percent of light curves -- sorry, the percent of
transits marked that are actually correct in each of these treatments. So we're taking the measure of
precision as if you marked a box, the center of the box is actually in the transit. This is the same thing
that the scientists in Planet Hunters used to evaluate their correctness, so we'll just use the same
measure as them. What we see is that when we pay people by wages. These differences I show you
here, using a parity test, are very significant, less than 10 to the -4. When we pay people by time, their
precision actually increases a lot compared to the people that are being paid by annotation. This is
something that we'd expect because the people that are being paid by annotation are trying to mark as
many things as possible. Surprisingly, we also see that the amount of the precision goes down as the
difficulty increases much faster for the task treatment than relative to the other treatments. Another
interesting thing to note here is that the precision does not decrease as quickly for the volunteers, and I
think there's two interesting reasons for why that might be possible. One is that we're naturally paying
people on this task, so given that this is the only treatment where people are not being paid and all the
other ones are being paid, the fact that we're paying people makes people like, for example, workers on
Mechanical Turk, value their time more. They're probably going to be a little less careful on basically
choosing what is right as things get more difficult. But also on the volunteer crowdsourcing site Planet
Hunters, workers were a lot more heterogeneous. What we basically took was fresh workers, gave
them a tutorial of this task, explained basically what they had to do, and they all had the same amount
of experience in this. What's different in the volunteer system is that the workers might vary a lot in skill
level, so given that you've marked something that was very hard to spot, it's more likely you'd be correct
relative to these other treatments. If we look at the recall at different levels of difficulty, there's kind of
a general trend here that you see. As things get harder, we actually get a worse recall. But the
differences here are significant if you kind of take out the overall downward trend, so the wages are
actually doing better than the per task treatment at all levels of difficulty by recall. We also see that at
really high levels of difficulty the paying by annotation has the highest recall of all the paid treatments.
In general, if we're doing tasks that we're paying people by -- sorry, if we're doing work that we're
paying people by task, as on Amazon Mechanical Turk, you may not want to pay people for each task if
you want the best recall possible because as the task gets harder, you kind of have very low levels of
accuracy in response to how much you’re paying people, because remember, in all these paid
treatments, the main weighted average wage is actually same. So what's one reason you might actually
want to pay people per task? Well, if we look at the characteristics of the sessions and the behavior of
workers, the number of tasks they worked on a session is significantly higher statistically than we're
paying by annotation, and they actually are the fastest; they actually work on the tasks more quickly
when we're make them by task. So if you have an algorithm that is trying to recruit workers for a certain
type of task, and you need your results fast, and you are less concerned about accuracy, then paying by
task is actually a good way to get those results done faster because it increases the throughput basically
at the expense of the types of accuracy I showed you before. At the same time, if you're willing to pay
people wages, earning the same hourly wage, they're willing to spend more time working on your task
and more time per task. So I think this is one reason why we saw the high levels of accuracy when we
were paying people wages, but the actual throughput of tasks is slower. Anyone have any questions so
far? The final thing that we studied is why did people finish working on their task. Given the comments
that they provided as they finished the task, we basically looked at the reason they stayed for actually
finishing. I think the one thing that was particularly surprising to me as I categorized these into these
general categories is that workers on Mechanical Turk are actually concerned about their evaluation for
the work that they do. A lot of workers actually stopped working because they thought that they might
be rejected or they were doing low-quality work. There are a couple reasons for this. Some people
thought that they were helping a science project and may have not wanted to provide bad quality
results, but a lot of workers were actually just worried that they were getting monitored somehow,
maybe because of the controls we put in, and that their work may be rejected. We saw a lot of people
actually say I don't know what I was doing; I'm actually going to stop now. We also saw some people
that basically had limits on the amount of time they can work on Mechanical Turk with regard, and
there's basically no way that we can increase their engagement. For example, there seemed to be some
turkers that actually work during their real jobs during their lunch hour. So this person said my lunch
break is over; I have to go back to work; I can't work on this task anymore. We also see a lot of people
that were kind of interrupted by much shorter types of things such as I had to go to the bathroom and
then your task timed out. Then one particularly interesting anecdote is that the way we pay people may
actually keep people interested in different ways. This is one person that was actually getting paid an
hourly wage and his money was kind of going up as he was doing it, but he thought that he would have
been more interested if he was paid for each task that he did, so he -- these workers didn't know about
the other treatment when they were in the task. It was kind of curious how this worker particularly
thought that he wanted to be paid in a different way. One other point I wanted to make as a result of
running this study, which some of you may or may not be aware of, is the metabehavior of workers in
paid crowdsourcing systems. When we look at how people behave on Amazon Mechanical Turk, there's
clear evidence basically from both our studies in that people ended tasks because they're not sure of the
quality of their work and also other observations on forums that workers will actually test your task
before committing to do a really large amount of work for it. They basically are gauging the amount of
effort they put in with how much they think you'll monitor them and correct their work. This might
seem like workers are smarter than what we want to deal with, but it actually says that if we design the
right type of controls for our tasks, we can actually use this type of self-policing behavior in order to get
decent results even when we're giving them non-performance-dependent pay. Just the perception of
giving people monitoring, at least for over the short term, can get them to produce good quality work
because they basically don't know how lenient this requester is going to be or how effective, how much
they'll get evaluated on their work. Additionally, if you run experiments on Mechanical Turk or doing
similar type of tasks, you'll see people talking a lot about it on other forums, so this is just something to
be aware of. They may talk. In the initial phases of our experiment, we noticed that people would
actually talk about how they were getting paid, and we basically had to do several things to control that
for the second phase of our experiment. You should be aware that people will discuss your task on
outside sources such as Reddit and mturk forum. If you have ever run a study on Mechanical Turk and
you Google the name of your study, you'll probably see lots of threads about that talking about it,
especially if it was a popular task. What have we learned here? Different types of incentive schemes at
the same level of payment create biases, so we can actually trade off different types of accuracy such
precision and recall with the speed of our work and use that as a control for different types of
algorithms. Given that you're trying to obtain a certain amount of work in crowdsourcing, it's important
to consider the right way to pay people, not just how much to pay people. What I see for the future is
actually being able to use this as part of crowdsourcing algorithms to kind of tweak the incentives for
different users in order to get the type of results that are beneficial for the algorithm. We also show
that, at least on this type of task and maybe some others, paid workers can be as good as volunteer
workers. And this is an important comparison to make because the domains of paid and unpaid
crowdsourcing have until now been pretty separate and not really compared. It's also important to be
aware of the metabehavior of workers on Mechanical Turk. So using the right design for your task and
the right controls can lead to a big difference in the accuracy of the results. In some ways, this work
actually raises some more questions than it answers. Because there's just so many different ways to
combine the financial and kind of intrinsic incentives for workers, we have a lot of room to go in creating
models that will actually combine these different types of incentives to predict how workers will behave.
So another thing we can do is actually quantify, for example, at both different levels of payment and
different methods of payment, what is the speed versus accuracy tradeoff that we can get. We've seen
that as we pay workers more, they may spend more time or work more, spend more time on the task or
maybe even work more quickly because they feel like they have more control over what they're getting
paid. Finally, I think it's important to actually take all these incentives that people have been studying,
both from the social perspective and the financial perspective and the unpaid and paid settings, and do
models that will actually explain how all these will affect workers' behavior. Does anyone have any
questions?
>>: You kind of mentioned it a little bit at the end there, but have you looked into, instead of just paying
per task, paying for correct answer and see how that affects the amount of time that they spend on
each task?
>> Andrew Mao: Yeah. I think there's kind of like a confounding factor there. What we are interested
in studying here is how do financial incentives bias people when they're not really getting paid by
correctness. So you can pay per correct answer, but there's different definitions of a correct answer in
here. There's like a correct annotation or a correct light curve, so I think when you mix those things in
together, I'm not sure like what the right question is.
>>: With the light curves, you will never get the ground truth while you're there.
>> Andrew Mao: Correct.
>>: Actually for this task, one interesting thing is people, given an annotation, then they [indiscernible]
sound right, they are still not confident that that's the exact answer, so they move the telescope to that
direction and then start observing the planet for longer times and collect more [indiscernible] data, and
then astronomers spend like a month on it, and they sometimes [indiscernible].
>>: Right, right. It might not be practical in actually practice, but it would be interesting to see how it
stretches out the amount of time they are willing to spend. It kind of goes to the point you made about
even just telling them it might be evaluated can change the behavior, although I wonder about the
observer.
>> Andrew Mao: Yeah. I'm just going to conclude with a short discussion of some future directions on
interactive crowdsourcing, which is tasks that actually use people at the same time and leverage their
interaction with each other. I have two examples of basically different types of ways to study this
problem. Feel free to chime in with feedback. This is all a work in progress at this point. Why might we
want to study interactive tasks? Well, for one thing people like interacting with others. Humans are
naturally social creatures, and for example, interacting with others might get people to be more
interested. We can pay them less. In many types of social settings, people will give your time for free,
so why can't we do that for crowdsourcing as well. That's actually a lot of reasons why unpaid
crowdsourcing has worked so well. What type of tasks can we actually do though, when we get workers
to work interactively? Not just interacting through some sort of system, but interacting more directly
with each other. Is this type of work better than working individually? There are several things that
might happen when we get people to work collaboratively or interactively. They might be able to work
faster as a group, at least increase the throughput of the work they're doing. They might produce better
accuracy by give each other feedback and being able to check each other's work, and they might be
more engaged. I think that's one of the most important factors. We can kind of get people to do more
work for less actual cost to the system. Is the increased money we have to pay them, for example, or
any other burdens that we get from putting people together, is that actually worth the increased cost?
Because we're actually using more workers, so if it's parallelizable, perhaps it's not worth it. The related
work in kind of interactive crowdsourcing comes mainly from originally from the idea of games with a
purpose. Luis von Ahn basically has many types of unpaid crowdsourcing tasks in human computation
where different types of games provide outputs through certain interesting problems by the way that
people interact with each other. And Walter has tried different types of tasks basically, such as being
able to drive a robot as a group of people, how to control that as a crowd, or how to carry on a
conversation with a nondynamic, evolving group of users. We're also kind of drawing some inspiration
from like a groupware research in HCI, which is basically interfaces that facilitate and mediate different
types of collaboration between people. There's actually many different types of interaction that can
happen when we put people together and allow them to collaborate. So we can create games where,
by virtue of interacting with each other, they're having more fun than if they were just doing a straight
crowdsourcing task. We can create shared context, such as ways to chat with each other or share
information, that will allow them to create, basically have better options for moving toward whatever
problem they're trying to solve. We can basically create atmospheres of competition or collaboration
between users, so if they're playing a game, they can be competing against each other which is actually
increasing their effort, or they might just enjoy collaborating with other people, basically to -- sorry.
Collaboration can actually make people more interested in the task itself. Then we can also create all
sorts of free incentives, such as levels, badges, or leaderboards, that allow people to indirectly interact
with each other and spend more effort. For example, people love gaining levels in different types of
games, even though it probably means nothing in the real world. I'm not going focus on the levels or
badges too much, but I'll give you a few examples of tasks where we might see these, where these other
functions might be useful to study. If we take a look at the Planet Hunters task that we were showing
you before, one way that we might be able to have people do this collaboratively is as a game. If we
take the games with a purpose, such as ESP game style mechanism, we basically want people to match
untransits together. One thing that they can do is these two players, they're kind of drawing transits
independently, and if Player Two draws a transit in the same area as Player One, with some really silly
matching algorithm we can basically mark this as something that was found by both players and give
them some number of points. Another way that we might study this is actually collaboratively, so this is
kind of more like a concurrent editing or Reddit style interface where people can basically look for things
at the same time and they can vote up or vote down each other's work. There's basically many ways to
approach this task. We could also give people to ability to chat with each other and interact in other
ways, so it's worthwhile to study why certain things will work better than others and what type of tasks
people would like to play games on versus collaborate with each other. I can also give a more realistic
example of what we might want to study interactive crowdsourcing, and this is stemming out of my
current internship this summer at Microsoft Research in New York. If we think about interaction
between users at a really large scale, how do we get an extremely large number of people to work on a
nonparallellizable task at the same time? What type of tasks can we solve when people are doing this?
What mechanisms can we use to support this type of work? Is there a better way to organizer workers
than just putting them all in the same place? There's actually a real motivation for this problem, and
that comes from the idea of crisis mapping. For example, when the earthquake hit Haiti two years ago,
there are nonprofit organizations that basically enlist a very large number of volunteers to filter through
all the news and tweets and other sources of information coming in to create a map of problem areas on
the ground where either first responders or NGOs can respond. The includes things that are like bridges
that are destroyed or fires or people that need food or collapsed buildings, etc., but the problem that
this is actually a humanitarian project with real needs and it's not parallelizable, so you can't really just
take a bunch of workers and ask them to do this separately because they're trying to build a combined
output which is this map of different events at the same time. The problem is that this task is actually
done in a really ad hoc way right how. So for example, the group, the humanitarian organization that
does this actually uses four different things in a deployment to solve this problem. They have a
collaborative map they're using here; they use a ning platform as a social network; they open up a
bajillion Skype chats; and they use a collaborative document system by a company which should be
unnamed. There's a real motivation to solve this problem effectively, right, because this is an important
problem that has humanitarian implications, but there's, it's not really optimal how this is crowdsourced
right now. Because there's actually so many of these crises happening, such as political unrest in
different countries such as Turkey and Egypt and natural disasters such hurricanes and earthquakes and
even like, for example, New York was hit by Hurricane Sandy, the fact that this type of organization is ad
hoc in that the centralized organizations for crisis mapping are actually overwhelmed in terms of
volunteers means that crowdsourcing is actually a could be a really effective way to solve this problem.
Being able to take untrained volunteers from the Internet and have them efficiently work on this type of
problem would allow it to scale much better. This is actually an interesting problem we are testing,
approaches for human computation, because we can actually simulate different types of crises,
including their news sources, and basically see how well different scenarios with different mechanisms
workers do compared to the real thing. What I've been working on this summer is basically a combined
interface that basically has a two-prong approach. First, we can create a platform that allows crisis
mapping to happen in real time and basically eliminate all the different platforms that people are using,
but also more importantly, it allows us to study the research question of how people should organize in
a crisis mapping -- in a complicated task that involves many people at once. Let me see if I can just do a
little demo here. Basically this is a -- sorry. What we basically have here is we have an interface where
people are able to chat in real time with each other. They can kind of create different chat rooms and
potentially specialize in different functions of this system. They can edit a document together
collaboratively in real time, and they can actually basically develop this catalog of events that are
happening in response to news. For example, you can see that we can actually edit this, change
something, and save it and it'll appear to other users at the same time. There's also a map of all the
different events that are happening. There's no Internet, sorry. This is a -- yeah, sorry. I'll just go back
to the presentation. So people have this chat room over here which they can create and delete rooms
and kind of talk to different users at the same time. They can edit things together, and they can
basically fill in this information as different news types are aggregated. There's a map that they can
basically see all the information that's being generated, and you can kind of see this and show this to
people, to first responders in a real crisis situation. Why is this problem important? There's actually
many question that I hope to answer from this. For example, how should people be organized in this
type of problem? Should they -- who should talk to who? How many groups should we create? For
example, if you think of the number of chats that people should be doing, how do you scale that? Does
the hierarchy matter? Do you need leaders in this type of problem? What happens when you have
small numbers of workers versus a very large number of workers? What type of communications are
effective? Should we have private communication or just chat rooms or other ways to leave contacts for
each other? One thing that we see in that actual crisis mapping task is that people specialize. For
example, one group of people will just focus on doing the geographic information, so they'll kind of get
really familiar with the terrain and being able to locate things very quickly on a map. They basically
specialize in doing that type of geographic information. Does this type of specialization carry over to
other things such as filtering the data and categorizing the information that's coming in? As we kind of
understand this problem better we, can try different things to study what effects are positive for getting
more work done and basically optimize this process of interactive collaboration. How should we get
people to organize? Should we appoint leaders? How should they talk to each other? This is a hard to
study problem, but it's also very interesting an important because it allows us use to kind of use the
Internet to do tasks that were previously never possible before.
>>: Wikipedia is already, there are lots of people editing the same document and they have their own
organization of people to approve and so on. Also fish tank is another kind of crowdsource sort of
annotation system that's been going on for a while, and they also have their own way to maintain this
kind of crowd-based relation systems, so why is this do different? Why can you ->> Andrew Mao: This is like kind of a -- one thing that happens in Wikipedia that it happens pretty
asynchronously, right. You don't have to be there all at the same time, and you can kind of respond
whenever you feel it, but this type of task is actually time dependent and people work on it in real time.
So they communicate with each other in real time, and you're kind of under some pressure to produce
some results in like a day or two, in real time as things happen. So it's much more important I think to
optimize the process in which people are working together here. Does that make sense? Just to recap
as to where I think all this research is going, this is my view of basically a smart crowdsourcing system
that we'll build, hopefully within the next decade. We have different types of tasks that we want people
to work on, and we have large collection of workers that are available. What we want to do is assign
workers to tasks in an intelligent way, so that means some workers may work in groups. Some workers
may work independently, and we kind of have an idea of what workers are good at and how long they'll
produce stuff. As people are working on these tasks, whether in groups or individually, we'll kind of
instrument them and look at how interested they are, how good they're doing, and we'll actually make
decisions based on the productivity that we observe of how we should pay them and if we should
change the way -- if we should basically give them other things, such as change their payment or change
their interface, to make them work more efficiently. This is basically a loop that feeds into itself and
optimizes a crowdsourcing task or a group of crowdsourcing tasks for an entire system in real time.
These are some important questions for the future of crowdsourcing, right. We basically want to
understand not just the financial incentives but how people work socially and what motivates people to
provide work in crowdsourcing and human computation settings. It’s important to both look at the
quality and engagement of workers because that measures not only how good of the work they produce
but how much work they produce. In the future, I think that this work will motivate algorithms that can
actually make more optimal decisions about how to allocate resources in human computation and
estimate how people are influenced by different types of incentives. I think there's a particularly
interesting area of study in how we can do interactive and social tasks to leverage the intrinsic interest
in people in working together. Thanks. [applause]
>>: One comment on the difference between this type of interactive tasks and Wikipedia is actually
Wikipedia is never reactive, right. You don't have to respond to something that just happened on
Wikipedia in terms of, like you mentioned, time-sensitive [indiscernible], but that kind of interaction.
Has to do with a crisis. Something just exploded or something just blew up.
>>: I think another thing they are trying to do is kind of giving special tools to this group in a time where
they really need to be very efficient. Like they are trying to produce these results in a day or two when
this earthquake or this disaster happens, and maybe if they need the chat rooms, they should have
some special chat rooms. Meaning the way this will go is like this design is not the final thing, but they
will keep building on it with the tools [indiscernible].
>>: In a more structured way than just having this convention of opening tons of [indiscernible], I guess.
>>: Yeah. Based on these observations, maybe they will try to come up with like a [indiscernible]
system with their own leaders directing tasks and so forth to get more efficient.
>> Andrew Mao: We don't want to just have like an open-ended system. I think the idea is to kind of
see how people are interacting in this task and then try different things and actually have an experiment
that we can say people actually did better when they had different chat rooms and they specialized
versus they are all in one chat room and just kind of everything was just talking over each other. I think
this is an interesting problem that motivates the study of that and gives it the platform to do it.
>>: About the payment. So how important is it for the users to understand exactly how they are getting
paid? This might be general, but the thing I was thinking is like you can have a mixture of these three
types of payments that can be either user dependent based on the performance or just random so they
can't strategize.
>> Andrew Mao: That's a good point. In this experiment, we made it very clear to users how they were
getting paid. We basically showed a banner at the top of the task that told them how they were getting
paid and how much they were getting paid. But you might imagine that different users respond
different ways to different types of incentives, so some users might work really fast when you're paying
them by task, but some users might be really lazy when you pay them wages. I think getting this
interaction between users and incentives is an important area of future work. It's not just basically how
we should pay each different person individually, but maybe if we're paying people in different ways,
can we combine their results to produce better quality output? Basically if you have an algorithm, for
example, that can basically say these people produced things with higher precision and these people
produced things at higher recall, can we combine them together?
>>: Is it possible that theory of economics can tell us about these this kind of problems for this context?
>> Andrew Mao: I think in the psychology literature, they have some, they definitely have some models
of how people respond to different types of incentives, and I think it's kind of similar to what we've done
in crowdsourcing. But we also wanted to look at how people did compared to something when they're
just self-motivated and not paid for anything. That was an interesting comparison to me.
>>: In the incentive literature, there's actually a lot of work about, based on what kind of payment,
what is the equilibrium of the situation? What is the behavior expected to maximize the profit of the
worker? The interesting thing is these results show that they are not exactly following these best
responses. For example, in the pay by time example, the equilibrium is you just want to kind of, once
you learn what type of control mechanism is happening, you would like to make sure that you are
clicking on something or touching the screen every two minutes, so you kind of continue dragging on,
open another tab somewhere else, and kind of continue going whatever you want to do and let the task
be there. You can play with the quality control mechanisms, maximize your profit, and minimize your
cost. We see that kind of thing, that the same condition holds for the pay per task mechanism, where
once they see that they can't pass after five seconds they tend to go ahead and start doing that ->>: Is the reason they're not doing that is that they don't have the rationality to understand it or do
they know it?
>>: Equilibrium doesn't take into account intrinsic motivation to do the task to just do well.
>> Andrew Mao: It also doesn't take into account people -- like for example, if you get your work
rejected, it's really bad, so there's a really big negative payoff if you behave completely economically
rationally. So what's happening here is people are kind of balancing their desire to behave in that
direction with kind of some negative costs that are not all completely clear.
>>: Maybe I think, to what you were originally saying, the problem with a lot of these equilibria in
practice is that you can't communicate the mechanism itself.
>> Andrew Mao: To workers?
>>: So you can imagine a much more complicated scheme than just pay per minute. They might not
completely understand it, but they may communicate, so you can still have a hidden control where
you're waiting for them to take an action every so often, and if they don't want to ever find out -- or
they might want to find out what that period is, the 30 seconds or two minutes or whatever it was, but
to explore that, they would have to get kicked out of the task, right. It's kind of like saying figure out
what can kill me; it's not worth exploring ->>: But in addition to that in line with the incentive literature, you'll see that people do this
[indiscernible] analysis based on assuming that people lack motivation and they will maximize their
profit. Even in the simple game settings like the [indiscernible] game where people can fully understand
all aspects of the mechanism and there is nothing to be revealed, people are still not following. People
have different kind of motivations, like altruism and kind of helping somebody else, not screwing up the
whatever system is, they still don't really follow equilibrium. That's why these experiments are critical
to understand people do, and that theory doesn't tell us that.
>>: What was the background that was shown to workers here in terms of you're helping science or
you're saving the world?
>> Andrew Mao: We basically showed a really similar tutorial to the Planet Hunters task itself, but we
didn't tell people they were helping science. We can't really tell them that, because they're helping
science, but we can't say that you're helping us find stars because it's kind of like a metatask, right.
We're trying to study the process of crowdsourcing which will eventually help people find more planets,
but maybe not directly here. So there's just basically a consent form, and I guess I didn't show it, but a
pretty long tutorial about how to use the interface to find transits.
>>: The only thing Andrew wrote there was how to apply Planet Hunters if they know what Planet
Hunters [indiscernible]. He didn't even say it's Planet Hunters.
>>: Obviously that plays to the motivations.
>>: I think it could be interesting to study a task that doesn't look like scientific.
>> Andrew Mao: You might see actually a lot of worse. If you have people that were unpaid to do this
task but were just interested in it versus people on Mechanical Turk that were probably completely -- I
don't know. It might be hard to make that comparison, because anything that is interesting enough for
people to work on without being paid is probably also interesting to people on Mechanical Turk.
>>: Yeah, that's certainly a risk [laughter]. There's some work on trying to figure out motivations, and I
wonder if it's a completely pointless task and that biases in the opposite direction. So you can find
something that looks like less like science but still looks productive, sort of the OCR tasks, then you have
push the green button type things where they're completely not doing anything helpful. They're just
taking part in a test.
>>: I am wondering if people like that where they were saying, by doing this, you are contribute this
larger goal of running this mechanism and people were more engaged. Being a member of the team.
>>: Yes. Yeah, yeah, yeah. Yeah.
>>: Thank you, Andrew. [applause]
Download