>> Ece Kamar: It's my pleasure to welcome Andrew Mao back to Microsoft Research Redmond. He's been interning with us last year, and now he's an intern at the Microsoft Research New York lab. Andrew is a fourth-year Ph.D. student at Harvard University working with Yiling Chen. He's interested in applying empirical methods for understanding human behavior in online settings. I'm giving you Andrew. >> Andrew Mao: Thank you very much, Ece. Today I will be talking about a couple of projects stemming from my internship last summer here and basically understanding better the connections between algorithms and human behavior in crowdsourcing. I'll just start with a really simplified dramatization of a crowdsourcing system. We have both paid and unpaid types of crowdsourcing. When people come to the crowdsourcing system, we basically are in need of their time and their effort. We either compensate them in financial ways such as payment or in other ways such as just common interest or altruism or social behavior. There are some fundamental questions in this type of environment because it really matters to us how much work that the workers produce. It's not just -- we can't just pay them. If we're not paying them, how do we keep them interested? How do we measure how much work people are willing to produce, and how do we characterize how they interact with each other in the system and outside of it? These are some fundamental questions, I think, for crowdsourcing that need to be addressed. First of all, why do users give us their time? If we're paying them, how can we get them to spend more time with us? If we're not paying them, how do we keep them interested and measure how much work they're able to produce? If we're paying people, how exactly should we pay them? I'll go into this in the second part of my talk, but it's not clear that the current systems for paying workers in crowdsourcing are optimal. Finally, how do we design systems that can leverage the interaction and interactivity between users, building crowdsourcing systems that are more social and take advantage of the fact that humans can actually do more when operating collaboratively than independently? The approach that is kind of guiding all the work I'm talking about today is that in order to build better algorithms and systems for crowdsourcing, we shouldn't just be modeling human behavior. There's been a lot of work kind of describing how or making assumptions about how humans in crowdsourcing systems work and how we can model their behavior, but we can actually understand it and use this, use the behavior to make adaptive algorithms that take advantage of different types of behavior. I'll be going over several examples of that today. Let me just give you an overview of my talk. Two of the projects I worked on last summer, starting with an empirical evaluation of different types of engagement in a volunteer crowdsourcing setting. I'll also show you another project where I look at different types of payments and how workers are biased and how this compares to volunteers in a crowdsourcing task. Then finally, if we have time, I'll go over different future directions in studying interactive crowdsourcing between different groups of workers. Let me start with an example of why it's useful to study engagement in crowdsourcing. Galaxy Zoo is basically a project that started over four years ago -- actually five years ago, in 2007. The idea is that in the Sloan Digital Sky survey, a telescope is pointed at a lot of different stars and has taken pictures of many of the galaxies in the universe. The idea was to take a lot of human workers and basically get a better idea of what type of galaxies exist in our universe and get a better understanding of the stars. This is about the simplest type of crowdsourcing task you can think of. In its first iteration, workers basically classified the galaxies into one of six types. They basically clicked one of these buttons here, and basically were repeatedly shown pictures of different galaxies. The idea is that once we have a majority vote on the classification of a galaxy, we have a consensus. Surprisingly, in spite of this task being really simple, people have spent a marathon amount of time on this task. The longest amount of time that people have done this task without leaving for more than 30 minutes was 17 hours long, and they went through almost 12,000 different galaxies. I'm not sure who that person was, but they are the type of crowd worker that we like to see. This leads to some interesting questions though, because we want to understand why people are willing to work for this long and why are they willing to give us their time. One idea to look at this is the idea of engagement or how productive people are going to be. How much time are they willing to spend? How much work will they produce? What do the activities of engaged workers look like? Can we distinguish between different types of workers in this system, and if we see workers in an online setting, how much longer are they able to -- will they continue to provide us information? How can we predict if they are actually going to quit or stop? This is just one definition of engagement which I'll cover today, but there's definitely, I think, more room for nuanced ways to factor, to take into account worker interest such as if they'll return in the future and things like that. Why do we care about this type of engagement? Well, the idea is that if we can mod el the features that allow us to predict when people are engaged, we can actually characterize what is actually driving participation. We can look at patterns and recognize when people are actually willing to work and when they don't. We can estimate the amount of work that our system has produced. We can actually take a look at the workers that are interacting with our system and basically be able to estimate how much work we're doing. Are we keeping people interested or not? If we're not keeping them interested, how do we improve so that we can get more work done? More importantly, we can actually take targeted interventions or different types of interactions to try to engage those workers such as new workers to the site that we'd like to convert them into people that will kind of turn into old hands or dedicated users and basically provide us a lot more work over the long run. Obviously, this type of problem is a lot more important for unpaid systems, because we don't have the luxury of just giving people money in order to get them to stay. I think this problem of taking an algorithmic approach to engagement is a natural extension of work that's been done so far with applications to crowdsourcing. Ece other collaborators at Microsoft Research have basically taken a look at decision-theoretic approaches for predicting how accurate a task is going to be and making decisions for how many workers to hire in order to achieve a desired level of accuracy. There have also been other decision-theoretic models for making optimal decisions in other types of crowdsourcing for accuracy, but this has kind of not been studied as much in the context of engagement and interest. There have been some types of attention models for different types of online communities, and it's natural to take this decision-theoretic or algorithmic approach in order to build a foundation for making these types of decisions. Let me just give a basic overview of the model that we'll be using to study this problem. We have different users in our system and they can basically work on tasks -- yes, go ahead. >>: Maybe my question is a little bit ignorant, but on the related work, I assume that they had some kind of models of users, crowd users, when they have this decision-theoretic model, right, otherwise how can they determine how many workers they can hire? >> Andrew Mao: Yeah, exactly. >>: Are you saying that those models are not adequately capturing the real people's engagement? >> Andrew Mao: Those models are primarily used in order to make predictions about the accuracy of tasks. I'm saying that we can, we should build similar models in order to take into account the engagement of people working in crowdsourcing. Does that make sense? >>: Those models assume that once people are assigned a task then they will finish it? >> Andrew Mao: Yeah. Then you also make some assumptions about the distribution of workers that you have. You don't take into account things as how long people are working and how many times they've come back and things like that. I think these models together in the future will naturally integrate into a much larger system. >> Ece Kamar: May I just add one thing? Those models are only reasoning about your immediate next task which will probably take some time from five seconds to a minute, but they don't really reason about your longer-term contribution to the crowdsourcing system in the next half an hour, an hour. That's where we are trying to get. >>: So those are kind of short tasks then? >> Ece Kamar: These are called micro tasks and they take very, very little amount of time to finish. >>: All right. Thank you. >> Andrew Mao: By the way, feel free to interrupt if you have any questions at any time. We'll basically take a look at a type of crowdsourcing given the tasks that I described to you, such as simple classification tasks, where there are different workers and over time, they complete different groups of these tasks. They can take different amounts of time on each task, and they might take breaks where they're not doing anything at all. They can either take short breaks or longer breaks which they'll come back in the future. We'll group these sets of tasks into different types of sessions. What I've shown you here is a contiguous session, and the idea of this is to capture groups of tasks that people do in a continuous amount of time on the site. We can further take these sessions and group them into aggregate sessions, and the idea of this is to capture amounts of time that people spend including short breaks but without completely leaving the site. This is like bathroom breaks, pizza deliverymen, and things like that. At some point in time, we basically have the problem of we know everything about the worker so far, including the task they're doing and everything in the past, and we want to make a prediction about when these workers will finish their sessions. When will they stop working? This is the type of problem that I'll been looking at today. There are different types of hypotheses under this model, right? Naturally, as people work longer, we're going to think that they're more likely to quit, right? At the same time, we can do more nuanced measurements of what people are doing, so if we see that people are spending more and more time on each task, they might be getting less interested because they're distracted, or they may also be becoming more careful and learning how to do the task better. If we see extremely short amounts of time on each task and people basically making the same decision all time, we might believe that they're not really paying attention. We can also say that perhaps people might work at the same amount of time each day, so if we know what time they're coming, we can make predictions about when we think they'll stop. Importantly, all this information is probably more powerful when we take into account past performance and what workers have done before, both for the same worker and for the group as a whole. Let's take a principle approach to solving this problem, so I'll start with some really interesting aggregate data that we've gathered from Galaxy Zoo. What I have here is a graph of the cumulative number of sessions that return after some amount of time. So given that someone has stopped working, for example, 40 percent -- almost 40 percent of people come back after 30 minutes. You can see that these are different things. 10, 15, 30 minutes. This is a log scale on the x axis. Particularly, you see some interesting patterns here in that at one day and two days, there's a noticeable spike in this graph. There's kind of a much denser group of people, sessions that are restarting after 24 hours and 48 hours. If we look at what this looks like for just a two-day period, you can see that there's a distinct pattern of workers that come back after one day. This is actually a really interesting type of distribution that we're seeing here, because after over one day, we kind of have an exponential drop-off in the number of sessions that return, but we also have this bathtub distribution that occurs over multiple-day periods. This is kind of saying that there's really promising ways to model this and basically get a better handle on what people are actually doing. We'll basically take a machine learning approach to this problem, so every time someone completes a task, what we'll do is create a training instance based on their history and everything we've seen so far. For each of these tasks that someone completed, we'll ask the question, when will this person -- we'll use this to make a prediction about the future and compare it to basically how this person performs. The idea is that we'll have a temporally consistent approach here, so we'll use some amount of data as a training set and then use the data that is further in the future as validation testing. This is important because we don't want to contaminate this data with training instances that came after it. We've generated a pretty large number of features for this problem, and I'll just go over some of them here and why they might be important. First, we can just take a look at what people are actually doing in each session, how many tasks they've done, how much time they've been spending. What is the average amount of time they've been spending on each task? What is the entropy of the work that they're doing? That might be a measure of if they're making the same decision each time or if they're paying attention. We can look at the tasks they're actually working on. How much do they agree with previous workers that have been doing this task? How much of their vote's been changing? Finally, we can look at their history. How much have they worked for us before? How long do they typically work? What time are they coming on the site versus when they usually come? And look at how long it's been since they last worked for us, so notice that all these features that we're focusing on are not domain dependent. They kind of work for all general types of classification tasks. Galaxy Zoo just turns out to be a great set of data to study this problem, but the idea that we're motivating this type of study for more general types of crowdsourcing where you can generate these features from data. It's important to make a definition of, to define what exactly is disengagement. We have a prediction problem, and what's the label that we're trying to predict? Well, we'll focus on at the binary prediction of problem of if someone's session will end within a certain number of tasks or a certain amount of time. There are still multiple ways to do this, so we can predict whether they will end for a short break or a longer break, and we've kind of used these definitions of five minutes and thirty minutes based on the distributions I showed you earlier as a proxy for these types of short or longer breaks. We can also make predictions by looking at how many tasks or how much time workers will actually spend on the task. We can basically -- yes, go ahead. >>: For those threshold minutes to design contiguous sessions and aggregate sessions, have you looked at any other tasks than the galaxy that you collected traces from to confirm that these will hold for other types of tasks? >> Andrew Mao: No. I don't have another type of task which I can show. We kind of picked these, I guess, looking at the distribution of work on Galaxy Zoo. I think what I'm trying to show today is what can we predict from this task and how it might generalize to other tasks, so I don't have a clear answer to that question. What I'm showing you here is basically the accuracy of this prediction as measured by AUC or area under the ROC curve. The idea of AUC is how well does your predictor, how well is your predictor able to tell a random, negative, and positive instance apart? If someone will leave within 20 tasks, how many percent of the time will we get to a comparison of those two things right? There's a lot of graphs here, but let me try and explain what's going on. On the left, we have prediction by the end of a contiguous session and on the right, prediction by the end of an aggregate session. We can see that the ability to tell disengagement from continuing to work is actually easier as we look at aggregate sessions because basically this is kind of a more meaningful measure of when someone will actually leave. Additionally, as we increase the time interval for prediction, if someone will leave within 30 seconds, one minute, two minutes, five minutes, the accuracy also increases. The ability to make this prediction increases, but at the same time, if we predict that someone is going to live within 30 minutes or in a more extreme example, one hour or two hours, that's not a very useful prediction to make because it doesn't allow us to take any actions as a result. So what we've basically looked at is chosen a label where we look where the aggregate session will end within 20 tasks, because this label actually has the highest correlation with all the other labels, and it's a meaningful tradeoff between the usefulness of the prediction and the accuracy. I can give you an idea of what these types of models look like when we compare them to each other. On this axis, we're predicting the probability that the session ends in five tasks, and this axis is the probability that it ends in 50 tasks. The blue squares are the training instances which actually ended within 50 tasks. As you can see, these models are actually predicting pretty correlated values. They're not exactly correlated, but they're very similar if you kind of transform this function in the probability that people will actually leave their session. A natural question to ask is given that we are making this prediction for the entire site as a whole, how do the different sets of features matter? I described to you different types of features such as the tasks features, the session features, and the user features. As we add more features, including the user history and the session features, the accuracy of our prediction increases, but also, we might be concerned about different subpopulations on the site. If we take a look at this model and apply it to segments of the population, such as only those users with allow amount of history or the users with a high amount of history, we see that it actually makes better predictions given more history, and of course, that makes sense. Additionally, when we take into account user history for people that don't have much -- the user features for people that don't have much history, the improvement is much less than when we have more history about the users. Another question we like to ask though, given what we see here, is does it make sense to train models for specific subpopulations on the site? For example, if we're interested in targeting only users with a low amount of history, can we actually get better performance if we train a model just for those users? The answer that we've seen from our data is the general model is rich enough to capture the performance of -- is rich enough to capture predictions about the entire site of users as a whole. What we see here is that if we train a model that is specific for users with lots of history, it actually doesn't do any better than the general model we've trained on the entire population. This is actually a similar result for users with low amounts of history. This is particularly interesting because usually we want to be able to have a rich enough model to capture information about users on the entire site, and this is saying that if, for example, we want to make predictions just about the users with little history, because they're kind of new and we want to convert them to more dedicated users, we can use a more general model. Yeah, go ahead. >>: [indiscernible] >> Andrew Mao: What I've basically taken here is like the top and bottom quartile of users by how much work they've done in the past. >>: How does the situation look like if they're like new people, like they have rich history? >> Andrew Mao: These are basically 25 percent of the users, the top 25 percent and the bottom 25 percent of users and basically all the instances that were generated by those users. We use those in a real -- one of the motivations of this model is to be able to use it in an actual crowdsourcing system. That means if we start something and we basically start asking people to do work, how much data do we need to achieve this level of accuracy about the prediction? One way to test this is under the temporal model I showed you before. You basically train with increasing amounts of data and see how long it takes for it to converge to the optimal level of accuracy. If we take a look at different amounts of training, such as one day, two days, three days, a week, two weeks, and a month, we can see that there's actually a lot of data here, so much that it actually takes about a month of different data before we reach the same amount of performance as in the general model. This is saying that basically if you're starting a crowdsourcing site fresh with users that you've never seen before, you should retrain your model until, in this case, for about a month until you reach the same level of performance. Another question we might want to ask is a how well does this model generalize over time? Given the amount of data that we have, how much variation is there in how well we can predict the engagement of users? In this case, what we'll do is take different amounts of training data and test over different segments of the future to look at how well the model makes predictions. What we see here is that the performance of the model actually does not change that much over time. It's slightly dependent on the set of data that is generated by the users, but there is no clear upward or downward trend; it's not getting more or less accurate. Given that we've kind of reached convergence of the performance in this model, it's not really necessary to be constantly retraining it in the future. What are some of the most important features in this model for making these predictions? As you would expect naturally, looking at the number of tasks that people have done in their previous sessions and also in their more recent sessions for making those comparisons is important, but also looking at the amount of work they've done in their current session and comparing that to the median of the work that they've done in the past allows us to make these predictions about if someone will leave or not. Also, how long they've been a user on the site is also an important feature, and it kinds of allows our learner to distinguish between different groups of users on the site as you saw in the previous slides. What I've shown you here today in this project is basically an attempt to measure engagement in a real system. What we learned is that using non-domain-specific features, we have a meaningful statistic for targeting users that are about to leave. So if we look at the top five percent of our predictions by probability, 54 percent of the time, those users will actually leave their session. This model is rich enough to capture different subpopulations on the site, and I've shown you basically which different groups of features are important for predictions. I'm not making a general claim here, but more that it's important to study the different types of features, such as in the session or the user history or the task itself, because in different domains those might have different importances. I've also showed the amount of data you need to converge to optimality and how well the model generalizes to future time periods. The future work that we're particularly interested in for this task is to basically take this model of engagement and test interventions out on the users. This means doing things like making the users more interested, getting them to come back in the future, and basically converting the workers that are just visiting us into more dedicated users, because that will increase the amount of work that we can ultimately get done as an unpaid crowdsourcing system. Some of the ideas here are, for example, showing a tutorial for the task for people that you think might be confused and disinterested, showing them particularly interesting tasks that will keep them interested. There are other things, so if we think that a user is doing a very bad job and we want them to leave, we might be able to increase the probability of that happening by kind of giving, forcing them to do something else, or even increasing the cognitive load that it takes them to do this task. These might be a little bit dystopian, but it could be a way that an algorithm can optimize the amount of work that's done in a crowdsourcing system. I'm going to talk about another project next. Does anyone have any questions so far? >>: Have you heard of Fold It? >> Andrew Mao: Yes. >>: They're using a game basically to crowdsource. Do you know about any kind of stats that they have released, like the engagement of people? If you have a game, does it change anything of your findings? >> Andrew Mao: I can talk about a little bit about that later in here, but this is, I guess, this is just based on the Galaxy Zoo which is trying to study a classification task. I think that gamification and different types of interaction could be really helpful for increasing engagement in crowdsourcing, and that's one area I definitely want to study. The question I want to ask next is what is the right way to pay workers in crowdsourcing? If you're familiar with Amazon Mechanical Turk, you'll know that there's basically a canonical way of paying workers for each task that they do, and we basically accept or reject the work. Is this the only way to compensate workers for their work? I'm arguing that this is definitely not the only option. The particular payment system that we're adopting here, paying workers for one task at a time, how does this affect them? It’s important to study this question and look at other ways you may be able to compensate workers in paid crowdsourcing. Another important question is how do the workers that we're paying compare to workers that aren't paid at all; that is, those in volunteer settings? I am going to attempt to give a partial answer to that question today with a really interesting study we've done using a crowdsourcing task. I'll adopt a similar model to what I showed you before, but we'll basically study the context of one worker. What we basically have here is a needle in a haystack task. Workers are looking at these tasks over the course of a session, and each task is basically a haystack with some needles we're interested in finding. Some things look like needles but aren't needles. We basically have some things that we want workers to find and some things that may be misleading and we actually want people to ignore. What we want to do is incentivize people to give us good accuracy on this task, to limit the false negative and false positive rate, but we need a realistic task to study this problem, and we need ground truth data in order to look at what the actual accuracy of workers is. What's the right way to pay people for this type of task? I'm going to focus in particular today on methods that aren't performance dependent. We want to study what type of biases are created by different types of payment when we don't actually have a ground truth or we aren't giving -- we aren't using gold standard data to evaluate the performance of workers. So in addition to the typical way of paying people for each task, there are other ways, such we might pay them standard wages, that in normal economics or in this type of task, in the needle in the haystack, we can pay for each annotation that people make, so each time they mark something we can make a payment. Or they might not be paid at all. So if we have a similar type of task in a volunteer system, what type of accuracy and characteristics can we expect from the workers there? Basically if we're paying people in these different ways, the theory in economics predicts that people will respond accordingly to the incentives in the payment. Even if they're not completely economically rational, there are still going to be effects based on how they're getting paid. In these payment methods I'm showing you here, we'd like to study what different effects that they cause on workers. The prior work in this literature is pretty interesting. People have studied not how different payment methods affect people but the amount of payment, and the consensus from several studies is that the amount of pay actually doesn't really affect the quality of the work. It can affect the quantity of the work and the quantity of the work done and the speed of the work done, but in general the quality doesn't the change very much. Additionally, there's kind of an anchoring effect for the level of payment, so if you're paying workers a given amount, they always think they deserve a little bit more. There's also been a lot of work looking at different types of incentives for getting people to give good quality work, and a lot of those focus on performance-dependent pay. So if you have some gold-standard data, and you're compensating workers according to that, how much does their behavior change? Although this isn't prior work, we're actually looking at how we can incentivize people and the biases they get when the work is not performance dependent. . A lot of this is, I guess, based on work in economics about how people respond to different types of incentives. The task that we're going to basically study here is finding planets in other solar systems using the transit method. There are these telescopes in orbit above Earth that are basically pointed at a bunch of different stars, and they measure the brightness of the stars over time. The idea is we can actually find planets in distant solar systems by looking at the brightness of the star. In here, you see a planet in orbit around the star, and when it passes in front of the star and it's in the same plane as the telescope that is viewing it, it causes the brightness of the star to dip. Basically by looking at these dips, we can find planets in other solar systems. You can imagine that there are some nuances to this task. There's actually a Zooniverse project called Planet Hunters where workers are asked to find these planets, possible planets, by looking at what is called a light curve, which I just showed you, which is the measure of the brightness of a star over time. This is kind of a pretty general interface for looking for these objects where you can zoom in, you can scroll around, you can draw boxes in to mark different transits. What I have here, this is a particular example of what might be a planet transit in this data. The idea of planet transit is particularly interesting because you can actually simulate them with mathematical functions. We can basically take some data that we've collected that has no planets, because basically the vast majority of the stars don't have planets around them when take this light curve data, and we can add a fake planet to this light curve. So we just imagine that there's a planet rotating around this star with some orbit and distance, some period, and some size, and we can subtract this from the brightness of the star and basically get some fake planets. This creates an interesting ground truth data that's very similar to the real data for this task and also used by astronomers. You can imagine that, with the ability to do this, we can actually vary the difficulty of the task. If we take a really big planet and put it in front of a small star, we get what is a really easy-to-detect transit because basically compared to the variance in the actual brightness measured by the telescope, this is a really big dip to spot. But we can also get really hard transits as well. We take a small planet, put it in front of a big star, the brightness changes a lot less, and especially if that planet is it going very fast, the width of the dip gets much smaller. This actually gives us a great test bed in looking at how financial incentives affect workers when they're working on tasks at varying levels of difficulty. The experiment we basically set up is kind of like a replica of Planet Hunters using Amazon Mechanical Turk. We'll give people a group of light curves, and they can basically draw different boxes wherever they think they see transits in this light curve. There's this counter at the top of their screen that shows the method they're getting paid and it updates in real time. Basically as they work through the task by whatever incentive method they're going to get, we'll show their payment method and update it constantly. They can basically draw any number of transits here subject to some limits which I'll discuss, and then they can either click next to continue to another task or submit their work and basically fill out an exit survey. The point of this experiment is to basically compare different incentive methods at the same average amount of payment. So if we're actually paying people the same amount of money but in different ways, how do they respond to that? We can vary the level of difficulty in this task, and we actually have the same tasks that were done by volunteers in the unpaid Planet Hunters crowdsourcing system. Additionally, when people finish their work in a series of light curves here, we can ask them why did they stop. What made you choose to stop doing our task and possibly go to something else? The design of this experiment actually comes in three phases. First, we looked at how the volunteers behave in the unpaid setting. We see that they work at some certain rate, and then the tasks per minute, and for all the tasks that we're that we're looking at, they make some number of annotations, which is transits that they've marked. Then, assuming that people in a paid crowdsourcing system, Amazon Mechanical Turk, behave the same way, we calculate the wages that we'll pay them such that they'll make $4.80 per hour. The reason that we picked this number, $4.80, is if you're familiar with the Mechanical Turk ecosystem, workers try to target a wage of $6 per hour, and that's kind of an acceptable rate at which they'll work an entire day on a given task. We basically want to set a wage that's slightly lower than that so at some point they'll be compelled to leave, whether due to disinterest or whatever other reason. Given that we've calculated these wages, people will all earn $4.80 in all the different treatments as long as their behavior doesn't change, but of course, we don't expect that it's not going to change, right? These financial incentives will definitely affect them. So what we'll do is actually scale the payments that we're making so that the workers are actually earning around the same amount of money. And the reason that we can do this, as I told you before, is because the prior work on financial incentives is that the behavior in terms of quality and some other features doesn't actually change that much based on the amount of payment, so we kind of account for the biases in here. We can just scale everything accordingly and people will get paid the same amount. What are some hypotheses of the biases that workers might produce when we compare this to the unpaid setting? As predicted by economic theory, we're looking at how accurately they're able to detect transits in these different light curves, planet transits. This is a kind of an ambiguous task, right, because there's a lot of noise in the light curve and some planets are big and some planets are small. Even if you're not explicitly cheating at the task, you might actually think that you found something and kind of be more lax in your standards for your work. If we kind of think about these incentives, if we're paying everyone by each task that they do as we normally do on mechanical Turk, we might think that the recall is going to go down because they're going to be less careful about looking for things. They want to finish as many tasks as possible so they won't look as carefully for transits, so the amount of tasks will go up, and they'll spend fewer time on each task. If we're paying them for each transit they actually marked without regard to whether it's right or not, we can think that the precision is going to go down, because of all the transits that are marked, fewer of them are actually going to be the needles that we're interested in finding. The recall might go up though, because they're basically going to mark as many things as possible, and they might also do an increasing number of tasks in order to make more money. Finally, if we're paying them just by wages, it's not clear what type of bias we'll see here, but they may just spend a really long amount of time in the session and maybe even work really slowly because their work is not really dependent on how fast they're producing results. Because all these incentives actually give economically rational agents some pathological behavior, we've put some pretty simple but very generous controls on the experiment. The idea of these controls is to kind of limit the type of crazy behavior we would see from people that are trying to make as much money as possible. Because of these different treatments we're running, we've put these controls on every experiment: people had to spend at least five seconds looking at each light curve, they could only draw a maximum of eight boxes on any given light curve, and if they didn't do anything for two minutes, based on detecting clicks and different events in the browser, we would actually just give them a warning, and then at three minutes kick them out. These controls were over the entire experiment, but they were mainly geared toward the type of pathological behavior that you might see in these treatments. They're designed to be generous enough so that only the most pathological workers would actually hit these limits, so we selected our data that you wouldn't actually be able to hit any of these limits, particularly in the number of annotations, by working on the task. Then we also limited the overall amount of time in the task to an hour or 200 light curves, which is basically more than the amount of time that we need to study. Let me just show you what happened when we ran the first phase of this experiment. Assuming that everyone was behaving the same way, they should have all earned the same amount of money, so of course, the people that were being paid by time earned $4.80 an hour. Does anyone want to predict how much the people that were being paid by task were earning? How much did their behavior change to increase the amount of time they were working? Any suggestions, in terms of like an aggregate average hourly wage for the workers in that treatment? >>: Was there any additional friction to having to go get new tasks? >> Andrew Mao: No. There was no -- what do you mean? No. There was no. All the tasks were in one Mechanical Turk HIT. Basically we used that as the -- sorry, the conduit for our experiment. There was basically no friction. You just click the next button and you get another one, so you can click that as much as you want every five seconds. >>: Are you asking for the mean or the median, because presumably they'd be different? >> Andrew Mao: This is kind of like the weighted mean. How much did we pay people for how much work they did divided by the total amount of time people spent? >>: That would be the mean. >> Andrew Mao: Yeah. The amount of money people were making when they were getting paid per task was over $8 an hour. >>: Presumably, a few outliers are going to really shift the mean. >> Andrew Mao: This is shift -- this is basically weighted by the amount of time that people spent, so the workers that basically spent only a small amount of time don't count much toward this. It's basically the people that did the majority of the work here that we're looking at how much money they made. Does that make sense? >>: Still right, a few outliers are really going to mess this up, so the median is going to be. >> Andrew Mao: I think you kind of have to weigh it by how much people did, right, because if you just take a look at the raw numbers, people kind of worked on the task for different amounts of time, and they're not -- you can't take the median of that, I think. It's weighted by how much work the people did. So how much in aggregate did we pay people who are working in this method? >>: As far as knowing whether this would hold up to any kind of statistical significance test, this could be two workers who are the result of this shift, right? >> Andrew Mao: We limited the amount of time for each session, so no worker is going to account for a huge amount of this, no particular worker. >>: What statistical test would you be running on this? >> Andrew Mao: We actually didn't run a statistical test on this. >>: We don't know if there's a real difference here. >> Andrew Mao: I guess, but I'll show you some other results, and maybe -- I mean that's not the point of what I was trying to show here, but I can -- let me go forward a little bit and see if I can answer your question. >>: I'm interested in your paying by time, and then you weight by the number of tasks. Like the 4.8, is this just the average, just the average of all these people? >> Andrew Mao: We basically take the amount of money we paid to workers in this treatment divided by the amount of time they spent on this treatment, which is basically weighted. We are paying them this wage, right, but for other people who are behaving differently. If you wager a guess at the amount of money we're paying when people are actually working by annotation, you can see a clear bias here. They are basically making almost $11 an hour, and some workers, like you said, the outliers, one person actually made over $20 an hour doing our task on Mechanical Turk. That might be around the highest amount of money anyone has made as an hourly wage on Mechanical Turk. What we did in the second phase of our experiment is basically scale down all these wages for the other treatments so that people were getting paid the same amount. We basically took the wage that we were paying people and divided it so that everything would be around $4.80. Assuming that these are all -- all the workers that we use for each treatment were unique, so assuming that we're kind of drawing from the same distribution with a large enough amount of data, these wages should be similar, right? What we basically see is that yes, the wages, once we've scaled them, we were actually able to get them to earn within a dollar of each other, and in particular, this number is extremely close. The number of workers I actually have in here is around almost 200 per treatment, and because we've kind of limited the amount of time they're able to work in each session, I don't think that the outliers are contributing a lot to each of these payment levels. We wanted to get some treatments where people were actually earning the same amount of money so we can kind of make an apples-to-apples comparison of how accurate the work that they're doing. >>: What's the N for each number? It's like over 1,000 or? >> Andrew Mao: Basically the amount of workers in each of these is, sorry, 180, and the number of light curves is about 25,000. Sorry, I should have put those numbers in here. Let's take a look at what happens at the precision, or basically the number, the percent of light curves -- sorry, the percent of transits marked that are actually correct in each of these treatments. So we're taking the measure of precision as if you marked a box, the center of the box is actually in the transit. This is the same thing that the scientists in Planet Hunters used to evaluate their correctness, so we'll just use the same measure as them. What we see is that when we pay people by wages. These differences I show you here, using a parity test, are very significant, less than 10 to the -4. When we pay people by time, their precision actually increases a lot compared to the people that are being paid by annotation. This is something that we'd expect because the people that are being paid by annotation are trying to mark as many things as possible. Surprisingly, we also see that the amount of the precision goes down as the difficulty increases much faster for the task treatment than relative to the other treatments. Another interesting thing to note here is that the precision does not decrease as quickly for the volunteers, and I think there's two interesting reasons for why that might be possible. One is that we're naturally paying people on this task, so given that this is the only treatment where people are not being paid and all the other ones are being paid, the fact that we're paying people makes people like, for example, workers on Mechanical Turk, value their time more. They're probably going to be a little less careful on basically choosing what is right as things get more difficult. But also on the volunteer crowdsourcing site Planet Hunters, workers were a lot more heterogeneous. What we basically took was fresh workers, gave them a tutorial of this task, explained basically what they had to do, and they all had the same amount of experience in this. What's different in the volunteer system is that the workers might vary a lot in skill level, so given that you've marked something that was very hard to spot, it's more likely you'd be correct relative to these other treatments. If we look at the recall at different levels of difficulty, there's kind of a general trend here that you see. As things get harder, we actually get a worse recall. But the differences here are significant if you kind of take out the overall downward trend, so the wages are actually doing better than the per task treatment at all levels of difficulty by recall. We also see that at really high levels of difficulty the paying by annotation has the highest recall of all the paid treatments. In general, if we're doing tasks that we're paying people by -- sorry, if we're doing work that we're paying people by task, as on Amazon Mechanical Turk, you may not want to pay people for each task if you want the best recall possible because as the task gets harder, you kind of have very low levels of accuracy in response to how much you’re paying people, because remember, in all these paid treatments, the main weighted average wage is actually same. So what's one reason you might actually want to pay people per task? Well, if we look at the characteristics of the sessions and the behavior of workers, the number of tasks they worked on a session is significantly higher statistically than we're paying by annotation, and they actually are the fastest; they actually work on the tasks more quickly when we're make them by task. So if you have an algorithm that is trying to recruit workers for a certain type of task, and you need your results fast, and you are less concerned about accuracy, then paying by task is actually a good way to get those results done faster because it increases the throughput basically at the expense of the types of accuracy I showed you before. At the same time, if you're willing to pay people wages, earning the same hourly wage, they're willing to spend more time working on your task and more time per task. So I think this is one reason why we saw the high levels of accuracy when we were paying people wages, but the actual throughput of tasks is slower. Anyone have any questions so far? The final thing that we studied is why did people finish working on their task. Given the comments that they provided as they finished the task, we basically looked at the reason they stayed for actually finishing. I think the one thing that was particularly surprising to me as I categorized these into these general categories is that workers on Mechanical Turk are actually concerned about their evaluation for the work that they do. A lot of workers actually stopped working because they thought that they might be rejected or they were doing low-quality work. There are a couple reasons for this. Some people thought that they were helping a science project and may have not wanted to provide bad quality results, but a lot of workers were actually just worried that they were getting monitored somehow, maybe because of the controls we put in, and that their work may be rejected. We saw a lot of people actually say I don't know what I was doing; I'm actually going to stop now. We also saw some people that basically had limits on the amount of time they can work on Mechanical Turk with regard, and there's basically no way that we can increase their engagement. For example, there seemed to be some turkers that actually work during their real jobs during their lunch hour. So this person said my lunch break is over; I have to go back to work; I can't work on this task anymore. We also see a lot of people that were kind of interrupted by much shorter types of things such as I had to go to the bathroom and then your task timed out. Then one particularly interesting anecdote is that the way we pay people may actually keep people interested in different ways. This is one person that was actually getting paid an hourly wage and his money was kind of going up as he was doing it, but he thought that he would have been more interested if he was paid for each task that he did, so he -- these workers didn't know about the other treatment when they were in the task. It was kind of curious how this worker particularly thought that he wanted to be paid in a different way. One other point I wanted to make as a result of running this study, which some of you may or may not be aware of, is the metabehavior of workers in paid crowdsourcing systems. When we look at how people behave on Amazon Mechanical Turk, there's clear evidence basically from both our studies in that people ended tasks because they're not sure of the quality of their work and also other observations on forums that workers will actually test your task before committing to do a really large amount of work for it. They basically are gauging the amount of effort they put in with how much they think you'll monitor them and correct their work. This might seem like workers are smarter than what we want to deal with, but it actually says that if we design the right type of controls for our tasks, we can actually use this type of self-policing behavior in order to get decent results even when we're giving them non-performance-dependent pay. Just the perception of giving people monitoring, at least for over the short term, can get them to produce good quality work because they basically don't know how lenient this requester is going to be or how effective, how much they'll get evaluated on their work. Additionally, if you run experiments on Mechanical Turk or doing similar type of tasks, you'll see people talking a lot about it on other forums, so this is just something to be aware of. They may talk. In the initial phases of our experiment, we noticed that people would actually talk about how they were getting paid, and we basically had to do several things to control that for the second phase of our experiment. You should be aware that people will discuss your task on outside sources such as Reddit and mturk forum. If you have ever run a study on Mechanical Turk and you Google the name of your study, you'll probably see lots of threads about that talking about it, especially if it was a popular task. What have we learned here? Different types of incentive schemes at the same level of payment create biases, so we can actually trade off different types of accuracy such precision and recall with the speed of our work and use that as a control for different types of algorithms. Given that you're trying to obtain a certain amount of work in crowdsourcing, it's important to consider the right way to pay people, not just how much to pay people. What I see for the future is actually being able to use this as part of crowdsourcing algorithms to kind of tweak the incentives for different users in order to get the type of results that are beneficial for the algorithm. We also show that, at least on this type of task and maybe some others, paid workers can be as good as volunteer workers. And this is an important comparison to make because the domains of paid and unpaid crowdsourcing have until now been pretty separate and not really compared. It's also important to be aware of the metabehavior of workers on Mechanical Turk. So using the right design for your task and the right controls can lead to a big difference in the accuracy of the results. In some ways, this work actually raises some more questions than it answers. Because there's just so many different ways to combine the financial and kind of intrinsic incentives for workers, we have a lot of room to go in creating models that will actually combine these different types of incentives to predict how workers will behave. So another thing we can do is actually quantify, for example, at both different levels of payment and different methods of payment, what is the speed versus accuracy tradeoff that we can get. We've seen that as we pay workers more, they may spend more time or work more, spend more time on the task or maybe even work more quickly because they feel like they have more control over what they're getting paid. Finally, I think it's important to actually take all these incentives that people have been studying, both from the social perspective and the financial perspective and the unpaid and paid settings, and do models that will actually explain how all these will affect workers' behavior. Does anyone have any questions? >>: You kind of mentioned it a little bit at the end there, but have you looked into, instead of just paying per task, paying for correct answer and see how that affects the amount of time that they spend on each task? >> Andrew Mao: Yeah. I think there's kind of like a confounding factor there. What we are interested in studying here is how do financial incentives bias people when they're not really getting paid by correctness. So you can pay per correct answer, but there's different definitions of a correct answer in here. There's like a correct annotation or a correct light curve, so I think when you mix those things in together, I'm not sure like what the right question is. >>: With the light curves, you will never get the ground truth while you're there. >> Andrew Mao: Correct. >>: Actually for this task, one interesting thing is people, given an annotation, then they [indiscernible] sound right, they are still not confident that that's the exact answer, so they move the telescope to that direction and then start observing the planet for longer times and collect more [indiscernible] data, and then astronomers spend like a month on it, and they sometimes [indiscernible]. >>: Right, right. It might not be practical in actually practice, but it would be interesting to see how it stretches out the amount of time they are willing to spend. It kind of goes to the point you made about even just telling them it might be evaluated can change the behavior, although I wonder about the observer. >> Andrew Mao: Yeah. I'm just going to conclude with a short discussion of some future directions on interactive crowdsourcing, which is tasks that actually use people at the same time and leverage their interaction with each other. I have two examples of basically different types of ways to study this problem. Feel free to chime in with feedback. This is all a work in progress at this point. Why might we want to study interactive tasks? Well, for one thing people like interacting with others. Humans are naturally social creatures, and for example, interacting with others might get people to be more interested. We can pay them less. In many types of social settings, people will give your time for free, so why can't we do that for crowdsourcing as well. That's actually a lot of reasons why unpaid crowdsourcing has worked so well. What type of tasks can we actually do though, when we get workers to work interactively? Not just interacting through some sort of system, but interacting more directly with each other. Is this type of work better than working individually? There are several things that might happen when we get people to work collaboratively or interactively. They might be able to work faster as a group, at least increase the throughput of the work they're doing. They might produce better accuracy by give each other feedback and being able to check each other's work, and they might be more engaged. I think that's one of the most important factors. We can kind of get people to do more work for less actual cost to the system. Is the increased money we have to pay them, for example, or any other burdens that we get from putting people together, is that actually worth the increased cost? Because we're actually using more workers, so if it's parallelizable, perhaps it's not worth it. The related work in kind of interactive crowdsourcing comes mainly from originally from the idea of games with a purpose. Luis von Ahn basically has many types of unpaid crowdsourcing tasks in human computation where different types of games provide outputs through certain interesting problems by the way that people interact with each other. And Walter has tried different types of tasks basically, such as being able to drive a robot as a group of people, how to control that as a crowd, or how to carry on a conversation with a nondynamic, evolving group of users. We're also kind of drawing some inspiration from like a groupware research in HCI, which is basically interfaces that facilitate and mediate different types of collaboration between people. There's actually many different types of interaction that can happen when we put people together and allow them to collaborate. So we can create games where, by virtue of interacting with each other, they're having more fun than if they were just doing a straight crowdsourcing task. We can create shared context, such as ways to chat with each other or share information, that will allow them to create, basically have better options for moving toward whatever problem they're trying to solve. We can basically create atmospheres of competition or collaboration between users, so if they're playing a game, they can be competing against each other which is actually increasing their effort, or they might just enjoy collaborating with other people, basically to -- sorry. Collaboration can actually make people more interested in the task itself. Then we can also create all sorts of free incentives, such as levels, badges, or leaderboards, that allow people to indirectly interact with each other and spend more effort. For example, people love gaining levels in different types of games, even though it probably means nothing in the real world. I'm not going focus on the levels or badges too much, but I'll give you a few examples of tasks where we might see these, where these other functions might be useful to study. If we take a look at the Planet Hunters task that we were showing you before, one way that we might be able to have people do this collaboratively is as a game. If we take the games with a purpose, such as ESP game style mechanism, we basically want people to match untransits together. One thing that they can do is these two players, they're kind of drawing transits independently, and if Player Two draws a transit in the same area as Player One, with some really silly matching algorithm we can basically mark this as something that was found by both players and give them some number of points. Another way that we might study this is actually collaboratively, so this is kind of more like a concurrent editing or Reddit style interface where people can basically look for things at the same time and they can vote up or vote down each other's work. There's basically many ways to approach this task. We could also give people to ability to chat with each other and interact in other ways, so it's worthwhile to study why certain things will work better than others and what type of tasks people would like to play games on versus collaborate with each other. I can also give a more realistic example of what we might want to study interactive crowdsourcing, and this is stemming out of my current internship this summer at Microsoft Research in New York. If we think about interaction between users at a really large scale, how do we get an extremely large number of people to work on a nonparallellizable task at the same time? What type of tasks can we solve when people are doing this? What mechanisms can we use to support this type of work? Is there a better way to organizer workers than just putting them all in the same place? There's actually a real motivation for this problem, and that comes from the idea of crisis mapping. For example, when the earthquake hit Haiti two years ago, there are nonprofit organizations that basically enlist a very large number of volunteers to filter through all the news and tweets and other sources of information coming in to create a map of problem areas on the ground where either first responders or NGOs can respond. The includes things that are like bridges that are destroyed or fires or people that need food or collapsed buildings, etc., but the problem that this is actually a humanitarian project with real needs and it's not parallelizable, so you can't really just take a bunch of workers and ask them to do this separately because they're trying to build a combined output which is this map of different events at the same time. The problem is that this task is actually done in a really ad hoc way right how. So for example, the group, the humanitarian organization that does this actually uses four different things in a deployment to solve this problem. They have a collaborative map they're using here; they use a ning platform as a social network; they open up a bajillion Skype chats; and they use a collaborative document system by a company which should be unnamed. There's a real motivation to solve this problem effectively, right, because this is an important problem that has humanitarian implications, but there's, it's not really optimal how this is crowdsourced right now. Because there's actually so many of these crises happening, such as political unrest in different countries such as Turkey and Egypt and natural disasters such hurricanes and earthquakes and even like, for example, New York was hit by Hurricane Sandy, the fact that this type of organization is ad hoc in that the centralized organizations for crisis mapping are actually overwhelmed in terms of volunteers means that crowdsourcing is actually a could be a really effective way to solve this problem. Being able to take untrained volunteers from the Internet and have them efficiently work on this type of problem would allow it to scale much better. This is actually an interesting problem we are testing, approaches for human computation, because we can actually simulate different types of crises, including their news sources, and basically see how well different scenarios with different mechanisms workers do compared to the real thing. What I've been working on this summer is basically a combined interface that basically has a two-prong approach. First, we can create a platform that allows crisis mapping to happen in real time and basically eliminate all the different platforms that people are using, but also more importantly, it allows us to study the research question of how people should organize in a crisis mapping -- in a complicated task that involves many people at once. Let me see if I can just do a little demo here. Basically this is a -- sorry. What we basically have here is we have an interface where people are able to chat in real time with each other. They can kind of create different chat rooms and potentially specialize in different functions of this system. They can edit a document together collaboratively in real time, and they can actually basically develop this catalog of events that are happening in response to news. For example, you can see that we can actually edit this, change something, and save it and it'll appear to other users at the same time. There's also a map of all the different events that are happening. There's no Internet, sorry. This is a -- yeah, sorry. I'll just go back to the presentation. So people have this chat room over here which they can create and delete rooms and kind of talk to different users at the same time. They can edit things together, and they can basically fill in this information as different news types are aggregated. There's a map that they can basically see all the information that's being generated, and you can kind of see this and show this to people, to first responders in a real crisis situation. Why is this problem important? There's actually many question that I hope to answer from this. For example, how should people be organized in this type of problem? Should they -- who should talk to who? How many groups should we create? For example, if you think of the number of chats that people should be doing, how do you scale that? Does the hierarchy matter? Do you need leaders in this type of problem? What happens when you have small numbers of workers versus a very large number of workers? What type of communications are effective? Should we have private communication or just chat rooms or other ways to leave contacts for each other? One thing that we see in that actual crisis mapping task is that people specialize. For example, one group of people will just focus on doing the geographic information, so they'll kind of get really familiar with the terrain and being able to locate things very quickly on a map. They basically specialize in doing that type of geographic information. Does this type of specialization carry over to other things such as filtering the data and categorizing the information that's coming in? As we kind of understand this problem better we, can try different things to study what effects are positive for getting more work done and basically optimize this process of interactive collaboration. How should we get people to organize? Should we appoint leaders? How should they talk to each other? This is a hard to study problem, but it's also very interesting an important because it allows us use to kind of use the Internet to do tasks that were previously never possible before. >>: Wikipedia is already, there are lots of people editing the same document and they have their own organization of people to approve and so on. Also fish tank is another kind of crowdsource sort of annotation system that's been going on for a while, and they also have their own way to maintain this kind of crowd-based relation systems, so why is this do different? Why can you ->> Andrew Mao: This is like kind of a -- one thing that happens in Wikipedia that it happens pretty asynchronously, right. You don't have to be there all at the same time, and you can kind of respond whenever you feel it, but this type of task is actually time dependent and people work on it in real time. So they communicate with each other in real time, and you're kind of under some pressure to produce some results in like a day or two, in real time as things happen. So it's much more important I think to optimize the process in which people are working together here. Does that make sense? Just to recap as to where I think all this research is going, this is my view of basically a smart crowdsourcing system that we'll build, hopefully within the next decade. We have different types of tasks that we want people to work on, and we have large collection of workers that are available. What we want to do is assign workers to tasks in an intelligent way, so that means some workers may work in groups. Some workers may work independently, and we kind of have an idea of what workers are good at and how long they'll produce stuff. As people are working on these tasks, whether in groups or individually, we'll kind of instrument them and look at how interested they are, how good they're doing, and we'll actually make decisions based on the productivity that we observe of how we should pay them and if we should change the way -- if we should basically give them other things, such as change their payment or change their interface, to make them work more efficiently. This is basically a loop that feeds into itself and optimizes a crowdsourcing task or a group of crowdsourcing tasks for an entire system in real time. These are some important questions for the future of crowdsourcing, right. We basically want to understand not just the financial incentives but how people work socially and what motivates people to provide work in crowdsourcing and human computation settings. It’s important to both look at the quality and engagement of workers because that measures not only how good of the work they produce but how much work they produce. In the future, I think that this work will motivate algorithms that can actually make more optimal decisions about how to allocate resources in human computation and estimate how people are influenced by different types of incentives. I think there's a particularly interesting area of study in how we can do interactive and social tasks to leverage the intrinsic interest in people in working together. Thanks. [applause] >>: One comment on the difference between this type of interactive tasks and Wikipedia is actually Wikipedia is never reactive, right. You don't have to respond to something that just happened on Wikipedia in terms of, like you mentioned, time-sensitive [indiscernible], but that kind of interaction. Has to do with a crisis. Something just exploded or something just blew up. >>: I think another thing they are trying to do is kind of giving special tools to this group in a time where they really need to be very efficient. Like they are trying to produce these results in a day or two when this earthquake or this disaster happens, and maybe if they need the chat rooms, they should have some special chat rooms. Meaning the way this will go is like this design is not the final thing, but they will keep building on it with the tools [indiscernible]. >>: In a more structured way than just having this convention of opening tons of [indiscernible], I guess. >>: Yeah. Based on these observations, maybe they will try to come up with like a [indiscernible] system with their own leaders directing tasks and so forth to get more efficient. >> Andrew Mao: We don't want to just have like an open-ended system. I think the idea is to kind of see how people are interacting in this task and then try different things and actually have an experiment that we can say people actually did better when they had different chat rooms and they specialized versus they are all in one chat room and just kind of everything was just talking over each other. I think this is an interesting problem that motivates the study of that and gives it the platform to do it. >>: About the payment. So how important is it for the users to understand exactly how they are getting paid? This might be general, but the thing I was thinking is like you can have a mixture of these three types of payments that can be either user dependent based on the performance or just random so they can't strategize. >> Andrew Mao: That's a good point. In this experiment, we made it very clear to users how they were getting paid. We basically showed a banner at the top of the task that told them how they were getting paid and how much they were getting paid. But you might imagine that different users respond different ways to different types of incentives, so some users might work really fast when you're paying them by task, but some users might be really lazy when you pay them wages. I think getting this interaction between users and incentives is an important area of future work. It's not just basically how we should pay each different person individually, but maybe if we're paying people in different ways, can we combine their results to produce better quality output? Basically if you have an algorithm, for example, that can basically say these people produced things with higher precision and these people produced things at higher recall, can we combine them together? >>: Is it possible that theory of economics can tell us about these this kind of problems for this context? >> Andrew Mao: I think in the psychology literature, they have some, they definitely have some models of how people respond to different types of incentives, and I think it's kind of similar to what we've done in crowdsourcing. But we also wanted to look at how people did compared to something when they're just self-motivated and not paid for anything. That was an interesting comparison to me. >>: In the incentive literature, there's actually a lot of work about, based on what kind of payment, what is the equilibrium of the situation? What is the behavior expected to maximize the profit of the worker? The interesting thing is these results show that they are not exactly following these best responses. For example, in the pay by time example, the equilibrium is you just want to kind of, once you learn what type of control mechanism is happening, you would like to make sure that you are clicking on something or touching the screen every two minutes, so you kind of continue dragging on, open another tab somewhere else, and kind of continue going whatever you want to do and let the task be there. You can play with the quality control mechanisms, maximize your profit, and minimize your cost. We see that kind of thing, that the same condition holds for the pay per task mechanism, where once they see that they can't pass after five seconds they tend to go ahead and start doing that ->>: Is the reason they're not doing that is that they don't have the rationality to understand it or do they know it? >>: Equilibrium doesn't take into account intrinsic motivation to do the task to just do well. >> Andrew Mao: It also doesn't take into account people -- like for example, if you get your work rejected, it's really bad, so there's a really big negative payoff if you behave completely economically rationally. So what's happening here is people are kind of balancing their desire to behave in that direction with kind of some negative costs that are not all completely clear. >>: Maybe I think, to what you were originally saying, the problem with a lot of these equilibria in practice is that you can't communicate the mechanism itself. >> Andrew Mao: To workers? >>: So you can imagine a much more complicated scheme than just pay per minute. They might not completely understand it, but they may communicate, so you can still have a hidden control where you're waiting for them to take an action every so often, and if they don't want to ever find out -- or they might want to find out what that period is, the 30 seconds or two minutes or whatever it was, but to explore that, they would have to get kicked out of the task, right. It's kind of like saying figure out what can kill me; it's not worth exploring ->>: But in addition to that in line with the incentive literature, you'll see that people do this [indiscernible] analysis based on assuming that people lack motivation and they will maximize their profit. Even in the simple game settings like the [indiscernible] game where people can fully understand all aspects of the mechanism and there is nothing to be revealed, people are still not following. People have different kind of motivations, like altruism and kind of helping somebody else, not screwing up the whatever system is, they still don't really follow equilibrium. That's why these experiments are critical to understand people do, and that theory doesn't tell us that. >>: What was the background that was shown to workers here in terms of you're helping science or you're saving the world? >> Andrew Mao: We basically showed a really similar tutorial to the Planet Hunters task itself, but we didn't tell people they were helping science. We can't really tell them that, because they're helping science, but we can't say that you're helping us find stars because it's kind of like a metatask, right. We're trying to study the process of crowdsourcing which will eventually help people find more planets, but maybe not directly here. So there's just basically a consent form, and I guess I didn't show it, but a pretty long tutorial about how to use the interface to find transits. >>: The only thing Andrew wrote there was how to apply Planet Hunters if they know what Planet Hunters [indiscernible]. He didn't even say it's Planet Hunters. >>: Obviously that plays to the motivations. >>: I think it could be interesting to study a task that doesn't look like scientific. >> Andrew Mao: You might see actually a lot of worse. If you have people that were unpaid to do this task but were just interested in it versus people on Mechanical Turk that were probably completely -- I don't know. It might be hard to make that comparison, because anything that is interesting enough for people to work on without being paid is probably also interesting to people on Mechanical Turk. >>: Yeah, that's certainly a risk [laughter]. There's some work on trying to figure out motivations, and I wonder if it's a completely pointless task and that biases in the opposite direction. So you can find something that looks like less like science but still looks productive, sort of the OCR tasks, then you have push the green button type things where they're completely not doing anything helpful. They're just taking part in a test. >>: I am wondering if people like that where they were saying, by doing this, you are contribute this larger goal of running this mechanism and people were more engaged. Being a member of the team. >>: Yes. Yeah, yeah, yeah. Yeah. >>: Thank you, Andrew. [applause]