Document 17864652

advertisement
>>: Each year, Microsoft Research hosts hundreds of influential speakers from around the world,
including leading scientists, renowned experts in technology, book authors and leading
academics, and makes videos of these lectures freely available.
>>: All right. Hello, and welcome. We're pleased to welcome Peter Organisciak -- do you want
to say it?
>> Peter Organisciak: Organisciak.
>>: Organisciak to -- the way that you can remember it, though, is it sounds very much like
organized or organizing, and it's very appropriate for a person who does work in the information
sciences space. So he's here from the University of Illinois at Urbana-Champaign, where he
works with Miles Efron, and he's going to be here all summer in Cambridge, Mass., working
with Adam Kalai and Susan Dumais and me this summer, and so he's out here to brainstorm, and
then will be with us from afar over the summer. So, thank you, Peter.
>> Peter Organisciak: Thank you. So, first of all, I'm very grateful to be here. I'm enamored by
the setup that you guys have here. This is my second day, and I'm having a lot of fun. So, today,
I'm going to talk to you guys about crowdsourcing, and I expect that some of you have some
expertise in crowdsourcing. But I'll talk about it from a couple angles that I think you'll find
interesting. Essentially, I'll look at it in two ways. In the first part, I'll talk about the motivations
of users that contribute to crowdsourcing, a lot of things that you may have inferred from your
past experience. But I'll hopefully help you think about it in a more schematic way, and that
follows from my master's thesis.
And then, secondly, I'll talk about implementation from the system end, namely, discussing ways
to think about the reliability of the contributions that you get from crowdsourced information.
So, yes, just a brief overview. I'm sure many of you are already familiar, but crowdsourcing is
defined as the act of a job traditionally performed by a designated agent and outsourced to an
undefined, generally large group of people in the form of an open call. The term itself, it was
coined in a Wired article in 2006, and if anyone's ever read that article, it was actually a much
more narrow scope of the definition. And, as the author of that article pointed out only a few
days later, Jeff Howe, when people heard the word "crowdsourcing," they sort of coopted it and
ran with it as something much broader than what he initially defined it as.
So it became this umbrella term, and as an umbrella term, it encompasses a number of other
things that we're already familiar with. So crowdsourcing itself isn't new, just that word to talk
about the phenomenon is new. So some of us may be familiar with Commons-Based Peer
Production by Yochai Benkler, which he introduced in 2002. Then, in 2004, he wrote a good
book called Wealth of Networks. There was a popular book in 2004 by a journalist, Wisdom of
the Crowds, where James Surowiecki made the argument of collective intelligence in aggregate,
and it was sort of a counterpoint to our traditional understanding of crowds as mobs, as sort of
irrational beings. And then, of course, in 2006 Luis von Ahn had the ESP Game, and he
formulated that around the idea of human computation, so using humans in a way that you would
traditionally use cycles, but there are certain things that humans can do that computers simply
can't.
So another way of thinking about that term of crowdsourcing is a verb to try to capture the idea
of attempting to utilize the wisdom of the crowds, so trying to go out to people with a task and
hopefully getting something intelligent back. And, since it's a verb, it doesn't necessarily -- it
doesn't inherently suggest whether you are successful in crowdsourcing. To crowdsource is to
push your project out, so, like, look what I crowdsourced with my friends.
When I started a couple of years ago in this area, it wasn't always clear why anybody would ever
contribute to some of the crazy projects that were coming out. People were excited, but they
weren't exactly sure how you actually create something that gives as much as it takes, and at the
same time, it was becoming apparent that the most successful projects seemed to find success
almost by happenstance. So today, or at least in the first half of this talk, I'll try to talk about that
issue of identifying whether it's a task that fits a crowd and how you get that crowd to come over
to your side. Oh, yeah, and I have a bunch of photos from Flickr Commons. These are all
public-domain photos.
So here's another good quote from Jeff Howe: "We know crowdsourcing exists because we've
observed it in the wild. However, it's proven difficult to breed in captivity." And yet that's
changing. In the past few years, we've been filling in the gaps in our knowledge of online
crowds slowly, but that question of what successful crowdsourcing sites do is an interesting
question, because it seems almost like many sites have almost stumbled upon the formula. So
what I wanted to do is actually look at the bottom-up learning of existing sites that have
approached crowdsourcing and been successful with it, and I wanted to figure out exactly what
they did, but there's a problem. So, like I said, 2006, crowdsourcing had just been defined as a
term, at least, and the definition was being coopted by the public, so how do I sample
crowdsourcing sites if we're sort of negotiating the meaning of this term? If I set a definition for
the sample and then try to sample around that, my definition would likely be inadequate, and
then I'd also run into the problem of methodically finding websites that fit that definition, right?
So if only there was an online crowd that semantically described websites, which, of course,
there was. Some of you may be familiar with Delicious, which was a website where people
saved their bookmarks, and as they're saving their bookmarks, they type in tags that describe
what that website is about. So I thought, if I'm looking for the sort of publicly defined area of
crowdsourcing, why not look at the most common websites tagged with the word
"crowdsourcing"?
So I looked at that. I collected 300 sites. I looked at every single one of them, which if there
was ever an argument to split up human labor into many tasks, it's when you're looking at 300
websites over the course of a number of weeks. I visited each one, wrote down keywords that
described their methodology and structure, then, in subsequent passes, standardized them and
synthesized them into higher-level concepts. So what I ended up with is a proposed list of 11
nonexclusive categories, and for me, my use was, because I wanted to get a breadth of what
types of crowdsourcing sites there. So this isn't the taxonomy, but it is one that hit every single
one of those sites, and then I was able to sample off of it. There were sites that approached an
encoding methodology, so perception-based tasks, creation, idea exchange, knowledge
aggregation, opinion aggregation, skills aggregation, and then there were also sites that took a
commercial approach in their structure, some sites that offered a platform to make it easy for
anybody to go onto that site and connect to a lot of people. The more sort of ludic sites, so
gaming, playful sites, group empowerment sites and just-for-fun sort of sites, and I'll focus on a
couple of these.
This was the distribution, at least on Delicious, of the categories, but that chart is dependent on
the biases of whoever tags on Delicious, so the chart itself isn't too important. But a couple that
are interesting to us here -- encoding websites are sites that approach the crowd from the idea of - from the human computation angle, thinking there are certain things that require human
abstraction, human reasoning to complete. If only there was some way that, by bringing lots of
people together, we can make it easier to complete these tasks, because by their definition they're
things that computers can't do.
So here's an example that I like. I also like it because I found this Scandinavian classroom
image, which sort of matches the Finnish word game. But this is a game for digitizing Finnish
medieval transcripts. And what happens is, these little moles would hold up words with the OCR
transcription and you say is this correct or incorrect. And then, as you can see with some of the
letters, it's a bit of a difficult task. To a computer, I don't know how easy it would be to teach it
that As look like that, right? So sometimes it works, sometimes it doesn't. Then, they have
another game where the ones that people vote as being incorrect, then you type them. The
reason it's a game is because you're rushing to do it as quickly as possible and you graduate to
higher levels, where you do more difficult tasks, so more OCR problematic words -- I'm sorry,
more uncertain words.
Here's another example of a perception-based task. Once again, transcription. If you look at
things like handwriting, that's another difficult problem, where OCR, you may find a lot of
errors. It's certainly a problem we can tackle, but just with a lot of trouble. So this is from the
Bentham archives. Jeremy Bentham was a philosopher, and what they found was, lots of people
in Britain liked to contribute to this, because they liked learning about some of their history,
especially retired folks and actually scholars on sabbatical. But one of the things about Jeremy
Bentham's handwriting is, as he got older, it got messier. This is still pretty clean. As it got
older, it was really difficult to make sense of it, and actually, I tried contributing to this one time.
It was terrifying.
Another one that's useful for us to know about, knowledge aggregation, so projects that bring
together what we know or what we've experienced. I saw a number of sites like that. Obviously,
we all know sites like Stack Exchange and found other sites that, relative to your position -- I
found a mobile site, for example, that if you find a pothole or something wrong with your city,
you could quickly notify the proper authorities, and then it mapped it. So, in aggregate, you
could see which parts of the city were being neglected. It made the city more accountable. They
couldn't just say, "We're going to look at the wealthy neighborhoods."
Skills aggregation, pretty self-explanatory. I just like this image, to be honest. Auto polo
apparently was a thing. Creation -- so that's where people create things from scratch. There was
one really cool project that I looked at called Star Wars Uncut, which split up the first Star Wars
movie into 15-second bits, and people would adopt these 15-second bits and refilm them in their
own way. They would animate them or use Claymation, and they remade this full movie from
that. Actually, I think they're working on an Empire Strikes Back one now. I saw a trailer a few
weeks ago.
So, yes, what did I do with this? Out of those 300 sites, I did a purposive sampling of a number
of sites that well represented each category, and then, looking at those sites -- I won't get into
details for time's sake, but I performed a content analysis on the points of interaction in each of
those sites. And, from there, I went and I interviewed people to try to find out what points of
interaction actually mattered to users.
On the other end, I ended up with a number of primary and secondary motivators that seemed to
suggest things that you either need to have or that is good to have in a crowdsourcing project to
compel people to contribute. I'll read these off, but I'll look at them closer in a moment. So there
was interest in the topic, ease of entry and of participation, altruism and a meaningful
contribution, sincerity, appeal to knowledge and, of course, money. So let's get money out of the
way. Money is the most reliable motivator, as you can expect. If you have nothing else, money
can find you contributors. However, it also has a tendency to overwhelm other motivators. It
can also be a bottleneck, so you're limited to how much you can pay people.
It's not completely problematic to pay people for tasks. Your own Duncan Watts did one of my
favorite studies a few years ago with Winter Mason, I believe, that show that there are still
intrinsic motivators when you pay people. But it's definitely more difficult to get people to get
excited about a task, even if it's easier to find people. Most of us know Mechanical Turk.
Everybody knows? We know Mechanical Turk. Here's another example, Quirky, which is a
collaborative product-creation website, where people submit ideas for a product that should exist,
and then the community develops that product together. They vote on it, they come up with
ideas on the specifics of how it will work, what the design should be like. And every activity
gives you a small percentage of ownership in that product, so then you actually get profits if that
product goes to market.
This was one of the sites that I looked at, and the people I interviewed seemed to think that it
wasn't just about the money. You see a lot of people just excited to create something or to
actually have their name on something that goes to market. I'm sure many of us can relate to
that.
One of my professors, Michael Twidale, likens it to Tom Sawyer painting a fence, where he has
to paint a fence while the other kids are playing, and one of his friends comes by and he tries to
make fun of him. And Tom Sawyer's like, "No, I love this. This is fun." Another kid comes by
and he's like, "No, I think this is great." And eventually he convinces another boy to take it over,
because that boy was led to believe that it's a fun thing, and eventually everybody wants to paint
the fence. And Mark Twain very nicely summarizes it, saying, "If Tom Sawyer had been a great
and wise philosopher, like the writer of this book, he would have comprehended that work
consists of whatever a body is obliged to do and that play consists of whatever a body is not
obliged to do."
Here was another really important motivator, interest in the topic. And you guys can probably
guess that, but it kept coming up over and over again, so much so that I feel like the best areas
for crowdsourcing are just areas of popular amateur communities that haven't gone online yet.
You could probably come up with a great quilting crowdsourcing site. Star Wars Uncut, people
participated in that because they like Star Wars, and there's a couple other examples that I'll talk
about later, where this played a big factor.
Ease of entry, ease of participation. This was cited as being important by every person I
interviewed -- sorry, for every example and by most of the people I interviewed. I chose this
photo, just because it's literally of Easy Street. Altruism and meaningful contribution. I didn't
have a photo of a High Street -- or a High Road, sorry, but people like helping out with things. If
they think you're being genuine in what you're performing, people can jump onto that. So here's
a good quote from a Library of Congress report about Flickr Commons, where they mention that
it was a very successful project for Library of Congress to partner up with Flickr. When they
first announced it, they said, "Help improve cultural heritage," and that seemed to have caught a
nerve with people. Galaxy Zoo is another example, which is where people encode galaxies,
many that have never been seen before by human eyes. Appeal to knowledge, opinion, sincerity,
and then there are a couple other secondary motivators. I won't go through all of these.
These I found, while primary motivators, you need at least one of those in a site to succeed.
These are things that seemed to encourage people to contribute more. So if they're already on a
website, this will push them a bit further. This was sort of surprising for me, indicators of
progress and reputation. So those are the gamification mechanics. I was surprised that it wasn't
more important for people, but lots of people named it as like a secondary, tangential thing. I
interviewed a guy who was a heavy FourSquare user, and I thought, "He's going to tell me about
gamification," and even he said, "I like FourSquare, but I like it because I know what restaurants
I went to with my girlfriend, and I know where my friends are. And then the points are good for
keeping me more diverse when I go out."
Cred, another one where I just liked the photo. I think it matches there. Just some feedback.
Okay, so let's move on. That's fine and good, but how does it relate to us? So on this part, I
want to talk specifically about how crowdsourcing relates to my area in information retrieval, but
also the bigger problem of understanding the reliability of people when you're dealing with
crowdsourced data. So here's the problem. Classification is a tiring task, and it's difficult to use
on large scales. So Galaxy Zoo, for example, they classified something like 60 million galaxies.
Before Galaxy Zoo, there was a doctoral student, and he did it day and night for months on end,
and he did something like 20,000, and that was far beyond what anybody else had ever done in
terms of galaxy classification.
We just don't have the time or the sanity to perform classification on large scales, and in
information retrieval, this is a problem because we need -- oftentimes, we rely on nicely
classified data for evaluation. And, as a result, we're often evaluating on the same TREC data
from the Text Retrieval Conference, and it's hard to do diverse work without the big overhead -sorry. It's hard to do work off the beaten path without the big overhead of creating something
that you can evaluate by. Another Microsoft Scholar, I think, Omar Alonso, has argued for paid
worker crowdsourcing as a way to overcome this tediousness and the problems of expanding to
large numbers of people.
So here is a question I was interested about. When you're not sure about a rater, how do you
determine whether they're reliable, and how can you do so in a way that is fair to them? So if we
crowdsource, if we have random online users contribute data that we're then building research
off of, how do we know that we should be building research off of that? And, more specifically,
how do we tell our reviewers that it's okay? Because, oftentimes, these people -- for example,
workers on Mechanical Turk -- what they are is self-selected and they're semi-anonymous. So
there can be people that show up, and they might be malicious. How do you know?
In my specific case, I was working on a project that was improving retrieval for metadata records
of varying lengths, and I collected a lot of relevance judgments, and they seemed good, but I
wasn't sure. Partially, the kappa scores were low, which is a measure of agreement. So I needed
to figure out, are there problem people in there and can I correct for them? I looked at them
along three different -- I looked at them from three different angles. First, I asked whether the
amount of time that somebody spends on a contribution -- so, in my case, making a
classification, does that reflect the quality of that contribution? Secondly, I asked do contributors
grow more reliable over time? So the more you do a task, do you get better at it. Finally, I tried
to look at whether your agreement or disagreement with other raters reflects your overall quality
as a rater. And, if it does, which by all accounts we should expect it to, how do you account for
that? How do you account for people that seem to disagree with other contributors in the
system?
So, before I move on, this is what a task looks like. You're given a query, a description of what's
relevant, what's moderately relevant, and then a number of results, and you say this result is
relevant to the query or non-relevant -- that sort of relevant, non-relevant dichotomy is
something that's common, albeit possibly problematic in information retrieval. Another bit of
terminology, what constitutes correctness? I'll talk a lot about accuracy, which is just the
probability of a person's contribution being correct, so what does this mean?
We had one data set of oracle classifications, which were just reliable raters that had submitted
the same classifications, and we could compare it to the reliable people. And then, at times,
when we didn't have the oracle data, we also looked at majority rating. So what does the
majority of people say? And this chart simply shows that there's a correlation between
agreement -- when a user agrees with an oracle or when a user agrees with other users. So, when
we didn't have oracle data, using majority rating was a good proxy, albeit more conservative, so
it erred on the side of rejecting.
And then I had two data sets -- we had two data sets. One was the one I previously mentioned,
cultural heritage data, 23,000 relevance judgments, and it was very easy. It was essentially a
binary task. There was relevant, non-relevant and then a third non-answer, called "I don't know."
And then, for diversity, I also had a data set of Twitter sentiment ratings. I actually -- was
talking to a computer science professor a few weeks ago, and he mentioned, we all have our
embarrassing Twitter sentiment study in our past. That's like a rite of passage, and I just slunk
into my chair, because I do, too. But yeah, so that data set had five possible answers, so it was
easier to get it wrong.
And then this I have mostly for information scientists, just to show that ratings distribution was
inverse power law, which haunts us everywhere we go, sort of uninteresting distribution. But
yes, so first question, does the amount of time a person spends on a question affect the quality of
their answer? So if it did, then what we would see is people that spent more time on answering - sorry, more time dwelling on choice would be correct more often. Can anybody guess whether
that ended up being true or false? The more time you spend on a task makes you more likely to
be right?
>>: It's false?
>> Peter Organisciak: As Jamie said, it's false. So the more time you spend on a task doesn't
suggest how good you are at that task. Yes?
>>: Were you looking at mean time, or were you looking at the entire distribution? What were
you actually comparing it to?
>> Peter Organisciak: I was looking at the amount of time that people spent between ratings. So
people were given a query and a number of results, and I was looking at the amount of time that
it takes them from the last time they submitted something to the next time. So how long were
they looking at that and thinking, "What's the proper answer," essentially.
>>: The total amount of this?
>> Peter Organisciak: It was the mean. Sorry.
>>: For each person, how long they took on average compared to their (inaudible)?
>>: You just have two groups, correct and incorrect, and then you look at the difference in
mean?
>> Peter Organisciak: No, this was by task, so this was all the ratings, I believe, and the mean
amount for this distribution is ratings that were correct, how much time it took, this is how much
time it took for incorrect ones, and there wasn't any difference.
>>: So you're just doing a t-test between these two distributors.
>> Peter Organisciak: Yes.
>>: The distribution of the corrects and incorrects. Those look reasonable.
>> Peter Organisciak: It was nonparametric, but they were pretty close. Nothing seemed to
suggest that people that did things quicker were sloppier.
>>: The p-value is reasonably low.
>> Peter Organisciak: It is, and I'll get to that, because once I started pulling out some
confounding variables, something interesting happened. Yes, so it was 0.15.
>>: Did you [parcel] out anything like individuals who just happened to be very fast, whether
they agreed or disagreed, or something like domain knowledge?
>> Peter Organisciak: Sorry?
>>: There are two things that could be correlated with speed. One is whether you're a quicktwitch or a slow-twitch person, so you should see that within an individual there wouldn't be
much difference, but maybe you have, when you did this, more slow-twitch people.
>>: And then you have cheaters, too, who just want to get through really fast.
>> Peter Organisciak: Exactly, yes.
>>: And those do a lot of hits.
>>: But the other one is domain knowledge. You would have people that have zero variance in
their responses.
>>: Could be the difficulty of the question, too.
>>: You just found the cheaters, and one thing that I and [Edjah Cameroon] found was, if you
see a worker who's got a very low variance in the amount of time, then they're probably just
pushing the button.
>> Peter Organisciak: Exactly, yes. Domain knowledge is an interesting question because both
my data sets were very simple tasks that the Twitter one was sort of perceptual -- does this Tweet
look positive or negative? It would be interesting, with a more difficult task, where more domain
knowledge is required, it would be interested to see whether this change is -- however, like I
mentioned just a moment ago, we would give people batches of tasks. So it wouldn't just be one
query and one result that you classify, because that slows things down, and they always have to
learn a new query.
So I started looking at whether the order affects dwell time, so whether it matters that this is the
first task in a set -- or, sorry, if you are doing a first task in a set, does the amount of time you
spend on it change things. Or if this is the fourth task in a set, does the amount of time you
spend on it change things?
>>: Can I ask one more detail on the setup?
>> Peter Organisciak: Yes.
>>: Did you have the same query and then different results for an individual? Did they look at
several different results for the same query?
>> Peter Organisciak: Yes.
>>: So can certainly imagine that paging in a new query would take you some time.
>> Peter Organisciak: So that was one of the reasons that we gave them a number of results,
because there were multiple queries, and they were only shown 10 possible results for that query.
Sometimes, they would return to a previous query. But, yes, so what I found was the amount of
time doesn't change. Regardless of whether it's your third task in a set, fourth, fifth, sixth, it
doesn't change, but the first and to a lesser extent the second task do take more time, which
makes sense, right, because you have to remember, the first task is conflated with just the
overhead of loading the task and figuring out what the question is, making sense of the query and
what's relevant, what's non-relevant. So you'd expect the first task to be longer. However, what
turned out to be the case was not only was the first task more likely -- not only did people spend
more time on the first task, but the people that spent more time on that specific first task -specifically on that first task, were more likely to be correct. So that's why earlier there is that
slight shift, even if it's not significant, because there's that effect only for the first rating of a set.
So here's what I thought -- that this performance increase is related to people reading the
instructions more carefully, so just people that spent more time on that first task were more
correct because they paid closer attention to the codebook.
But, if that was the case, then that would also linger to all their other tasks. It's not just I spent
more time on the first rating and I got the first rating correct. It's spent more time on the first
rating and I got everything more correct, or I was more likely to be correct on everything else.
And that ended up being the case, where here I split up people that got the first task right and the
people that got the first task wrong. I don't know why I did this sort of simplistic thing, but
essentially we had more A students amongst the people that got that first task right, looking at
the other nine tasks that they did after that first one. That seems to suggest I think an interesting
user-interface question, where you could quite quickly figure out how likely somebody is to be
correct based on how good they are straight out of the gate. It also seems to suggest, at least to
me, that bad raters aren't necessarily always malicious, but sometimes they just didn't read the
instructions good enough, and as a result, they were worse throughout.
>>: Can you distinguish between bad raters and people who were just careless during the first
session with this probe?
>> Peter Organisciak: Yes, so I don't think you can distinguish. If anybody has any ideas for
how you would study the difference between people that are trying to cheat you versus people
that are careless.
>>: The "I don't know option," how did you treat that?
>> Peter Organisciak: I treated it as a non-answer, so if somebody said "I don't know," I just had
somebody else rate that same thing.
>>: Maybe you could look at some tasks that are really easy versus tasks that are hard and use
them with cheaters who aren't reading the instructions, in either case.
>> Peter Organisciak: Yes, so I thought that was interesting, and that's why I'm sharing it,
because I liked it. So the next question I asked was experience. Do you grow more reliable over
time? And specifically, I looked at lifetime experience, the amount of time, the amount of
classifications you made for me overall, and also query experience, so the amount of
classifications you made specific to an individual query. So in that previous screenshot, the
query was plain, so how many times have you classified things as relevant or not relevant to a
plane? And I didn't see an effect there, where somehow my dots got replaced with a dot missing
in each here, but essentially, more or less, people on their tenth rating were just as good as
people on their 200th rating. If you looked at query experience, it told a slightly different story.
However, when I was reviewing this yesterday, I really should have put the standard deviations
here, because I think this overemphasizes the effect. As you got to that further end, there was a
lower end. There were less people that did 50 or 40, but still, it went up slightly.
Yes?
>>: Did you look at time for each user as you go and see each query? So what I'm thinking of is
in the 2011 TREC, there was this study where they took time, and what they saw is essentially
for each user, a user gets faster and faster as they get more experience, but at every query, there's
a peak, so it looks kind of like a saw tooth. There's a peak for the new query. It starts to go
down, then there's a peak, and it starts to go down. It's a very slow drift down in one direction,
but you get faster per query, and then you peak every time you get an essentially new task.
>> Peter Organisciak: That's a good question. I have the data to look at that, but I didn't actually
look at it. I like that. Yes.
>>: Yes, so this is a question from Rajesh Patel. How do you look at the threshold, the lowerinput threshold of the dwell times on tasks for people who were correct versus people who were
incorrect?
>> Peter Organisciak: Sorry, did I look at?
>>: So for people who got better results, did you look at is there a particular threshold of time
that is spent, the least amount of time and the most amount of time that was spent was?
>> Peter Organisciak: I did that as I was playing with the data, and I don't recall for sure, but I
didn't see too much of an effect. So people, even when they're really quick, seem to be pretty
good at it, but one of the confounding factors might be how long the result was. So that might be
worth exploring. I don't explore it in detail, remembering that the data set I was looking for
specifically varied between really short, like tweet-length entries and then really long, multiple
pages out of a yearbook entries with a sort of balance between those two extremes. Yes, so last
question I looked at was whether your agreement or disagreement reflects your overall quality,
and I tried two things. First, I tried removing raters that disagreed with a lot of people to see how
that affects the quality of the data set as a whole.
And then I tried weighing raters, so people, contributors, weren't completely removed, but the
strength of their contribution was weighed upwards or downwards based on how consistently
reliable they were. So for replacing raters, first of all, for replacing raters, what we did was we
assigned raters a maximum likelihood user reliability score, which is just the probability of you
getting the correct label. If you were below a certain threshold, I took you out, and that threshold
was 0.67, which sounds like a nice not round number. It wasn't fully arbitrary. What I did was I
created a number of bots that went through the data and randomly labeled items, and those bots
on average had a reliability score of 0.67, so if anybody was below that, then they were worse
than a random bot, so I took them out. And what happened was, sure, my kappa went up, but
there was very little increase in overall accuracy, which, when you think about it, is sort of
sensible. I was just taking out people that disagreed with lots of other people, but it goes to show
that kappa scores, they're a good indicator for their original purpose of measuring contention, but
they're somewhat misused as an indicator of reliability of a data set. And that's not to mention
the fact that the kappa assumption wasn't intended to be applied to hundreds of raters, to begin
with.
Okay, so removing people didn't improve the reliability of the -- sorry, the quality of the overall
data set. What about weighing users? So what I did with weighing users is I similarly calculated
a reliability score for each user. I set it as the confidence in all of that user's ratings -- so if
Adam, for example, had a reliability of 0.7, all his ratings were valued at 0.7 of a contribution.
And then I went and I recalculated all the user scores after doing that. It's EM like in that I had
this cost function with the user scores and, over time, what would happen -- you can imagine,
what if I'm a good rater, but I just get unlucky, and I'm rating the same thing that two other really
bad people are rating, and I disagree with those really bad people, but that's just because they're
bad. They're cheaters or something. I don't want to be punished for that.
So by iterating what you would expect and what you see to a certain extent is over time, those
bad raters, their contribution grows lower. If I'm a good rater, my contribution grows higher, so
I'm punished less, and it pulls apart the effect of who you're rating alongside. So I calculated this
in two main ways. I won't talk about that third one, but I compared majority rating, which is just
taking the answer that the most people say and then majority rating using the user weights, and
then I tried different ways of assigning those user scores. The one I showed on a previous slide
was the simplest one. This doesn't matter.
I found that the lower bound was pretty high to begin with, so even when I was just taking
majority rating, without doing anything fancy, when I was just saying three people, or these three
people, look at this and classify it and the thing that two or three of you agree on, that's what I'll
take as the answer. That ended up having a high accuracy rating for the overall data set. So the
iterative algorithm gave me some improvements, but nothing significant. Hold on. Yes, so why
is that? One reason that I think it is, is because my data was really clean to begin with. TREC
last year had -- two years ago, for the first time, had a crowdsourcing TREC with essentially the
same task, how do you identify bad raters and account for them? But their majority rating, so
that baseline of the most number of votes, was around 70% in accuracy, while our data was 86%
accurate out of the door.
There wasn't much to improve on, and at the same time, it meant I didn't need to be so fancy with
what I was doing. That suggested that -- sorry. My secondary data set, the Twitter data, that
data improved more so because it was a lot more complex than just binary ratings, is this relevant
or non-relevant? Yes.
>>: The 86% to 70% number, are you comparing them over the same data set?
>> Peter Organisciak: No. A different data set, so this was just using the TREC crowdsourcing
data set. There, if you do something very simple like going by majority rating, you don't get the
same quality as our data set. So there was something about our data set and how we collected it
or maybe the queries. There was something about our data set that made people more likely to
be right. Just by going by majority rating was good enough.
And that seems to emphasize the point that, oftentimes, people are trying to be correct, because
mathematically, if most people are making good faith effort to answer your question, then the
people that aren't, they'll just get filtered out, even with only three people rating the same thing.
And then, of course, we found that raters that get the first rating correct are overall better raters.
This may be due to avoidable sloppiness, and I'd love HCI people to think more about that
problem. How do you convince people to think more about the codebook, and I suspect there are
a number of ways. One is maybe some qualifying tasks that a person has to take before jumping
into everything.
>>: You're not giving them any feedback, here, while they're doing it, right?
>> Peter Organisciak: I'm not giving them feedback, but that's the other thing, like we spoke
about yesterday. I'd love to see how people react if you show them how, according to our
calculations, how well they're performing. Do they make better effort to backtrack if they're
doing something wrong? Will they go and look at the results again? I found out raters grow
comfortable with a task quite quickly. After two tasks, they're already doing everything at the
speed that they'll do all their other contributions at. Then I found that EM-like algorithms,
they're better for complex tasks, but if you have something simple like binary classifications, the
improvement is small.
So, thank you. That's the gist of my talk. I have a couple other things if anybody would like to
talk about it, but thank you for coming utility. Yes.
>>: Do you have any thoughts for what would happen in a machine where you didn't have such
clean data? For instance, if you had a couple of judges who were just out to get you? Maybe
they had a majority of the judgments?
>> Peter Organisciak: Yes, so that's one of the things that protects crowdsourcing. I taught a
course on crowdsourcing at a newspaper publisher a few years ago, and people would say -people would ask me, why would you let anybody contribute? What if they're malicious? They
could take down your whole site, or they could write things that you'll be liable for. But there's
strength in numbers, and oftentimes, that's what works out. Separate from paid crowdsourcing,
if you have something like online commenting or whatever, you can also crowdsource the
flagging feature, where a lot of people that are honest participants in the community will flag
something that's dishonest. But your question hits on a point where these sort of approaches do
break down. If there is a coordinated effort to be malicious -- 4chan spent many years trying to
push at that boundary -- then you do break things down.
>>: I've also run experiments where literally one or two people did like 25% of the work, and
those guys were cheaters.
>> Peter Organisciak: That's interesting.
>>: But it may have been the nature. It was a very subjective test. It was kind of like, "Do you
like this or not?" So maybe they had a feeling like, "Hey, I can do whatever I want." I don't
know.
>> Peter Organisciak: That's a good point. So if you -- most of the time you can hide in
numbers. If you can't, that's when you come into problems. 4chan actually, they tried to rig -actually, a few years ago, they rigged the Time Person of the Year People's Choice voting,
because Time didn't have any ways to stop people from repeat voting and no obstacles, so they
made 10 million votes, all for the creator of 4chan, and of course he got to the top of the
rankings. But then what Time did was they implemented reCAPTCHA, which was another
project by Luis von Ahn, and what happened as 4chan couldn't automate a way to get past
reCAPTCHA, so they tried to overwhelm it by just putting the same words in over and over, but
they couldn't. So what happened was their gaming slowed to a crawl because they had to by
hand make each vote, and it went from 10 million to 10,000 after the Captcha was implemented.
Plus, they digitized some old books, although I think half the digitization was just lewd words,
which weren't correct. Any other questions?
I had this other sort of fun project that I did a few years ago. I was trying to explain to a class the
idea -- one of the things that you really need to be considerate of with crowdsourcing is the ethics
of it. So with volunteer crowdsourcing, if you have volunteers contributing, they can smell
insincerity much of the time. They know if you're just trying to use them. If you're paying
people, it's harder, because you can connect to markets in developing economies where you're
not paying people a good wage, but it's the best they can get. So for this class, I tried to show an
example of Mechanical Turk while making people stay aware of the ethical implications.
So I had Mechanical Turk rewrite Jonathan Swift's A Modest Proposal. Is everyone familiar
with A Modest Proposal? It was an essay by Jonathan Swift where he was making fun of the
sort of cold social engineering rhetoric of the time. So lots of people were saying, "Here are the
solutions to Scotland's woes. This is how we do it." And oftentimes, those solutions didn't
actually think about the people involved, so Jonathan Swift made fun of it with this essay where,
in a sort of straight-faced manner he suggested that the poor Scottish families should sell
children for food to the English. Since that sort of satire was unknown at the time, it was
shocking to people. Today, we know it as satire, so it's not shocking to us.
I had the system, at the start of the class, pushed a button and it split the essay up into a number
of sentences. Then it gave the sentences to Mechanical Turk workers, and they were asked to
rewrite it in modern English, so just in colloquial -- in a more casual language. But what's more
modern than Twitter, so I had them rewrite it in 140 characters, which was partially to keep them
from just copy and pasting the same thing. So three people would rewrite every sentence, then
three other workers would vote on the best sentence, and that would be chosen as the best
version. Once again, this language, which to us isn't as horrific at the time, it suddenly became a
lot more to the point and, of course, there were people behind the scenes rewriting this. I wonder
what they thought. And, of course, there was a Twitter account until -- I don't know, I didn't like
it. So that's just a fun little project, and that's it. Thank you.
Are there any other questions or thoughts? Discussions? Any slides people would want to see
again?
>>: I didn't know about Flickr Commons.
>> Peter Organisciak: Flickr Commons is great. Flickr Commons actually reminds me of
another project, in Australia, that the National Library of Australia did, where they've been
digitizing old newspapers. And the problem with Australia -- or with Australian newspapers -- is
they often got printing presses from England, like broken-down, old printing presses. So the
quality of old newspapers isn't good, so they scanned it, put it up and put the OCR text, but the
OCR text is terrible. They put it into the functionality where you could hit edit and correct it. I
think that's a very strong model, how, as a consumer, you're reading these newspapers and you
see, oh, there's all these typos in the OCR. Oh, there's an edit button.
Well, all of a sudden, I can switch from being a reader, from being a consumer, to being a
contributor, to fix something. That's part of your experience. You're sitting there in front of the
computer and you see something that can be fixed.
>>: And you're around tomorrow, right?
>> Peter Organisciak: Yes, if anybody would like to talk to me further or about any other
projects. Great. Thank you.
Download