>> Ofer Dekel: The last talk of this session will be given by Ece Kamar from Microsoft Research. And the title is: Combining human and machine intelligence in large scale crowdsourcing. >> Ece Kamar: Thank you. And also thank you for the earlier advertisement for this talk. I will be diverging a little bit from the general theme today, because this talk not only talks about machine learning, it really talks about the synergy between using machine learning approaches with decision-making approaches for solving a real world problem. So I'm a researcher in the Adaptive Systems Group, and in our group what we are working on is developing systems that can function in a stochastically changing world. And these systems need to interact with the same environment and with also actors, different actors that exist in this world. So such an adaptive system needs to have a set of capabilities. One of the most important capabilities is making sense of the world through noisy and partial observations about the world; and, of course, machine learning approaches are crucial for having this capability. But usually having this capability is not enough. These systems also need to act effectively in the world. And for having this capability, you need to have good decision-making models as well as these models of the world and there could be like palm DP like algorithms, and that's what I'll be talking about today, but also there could be some reinforcement learning algorithms, too. And in our group we are working on a set of domains varying from the outlook systems to systems that connect in mobile systems to other -like assistance for users, but today the particular domain that I want to talk about is crowdsourcing. And other speakers today have been talking about crowdsourcing, but just as a recap, in the recent years, crowdsourcing has been really popular for providing programmatic access for human intelligence. But the current crowdsourcing applications have one particular drawback, which is it is very difficult to manage the quality in crowdsourcing. The responsibility of managing these crowdsourcing tasks and assuring quality is on these task owners, for example, is a person going to Mechanical Turk, you need to say I want to hire three workers for this task, I want to pay this much for this and this is my consensus. This is how I will define my ground truth afterwards. And this is not the best way of doing this, because as a test goal going to Mechanical Turk for the first time you don't know what the parameters are the best ones to do. So -- and this is not the best use of people resources. And also at the end of the day we're not maximizing our worker resources in the best possible way. Today I want to talk about crowdsynth, a system we're designing for managing system tasks. Crowdsynth has two main capabilities. The first capability is machine learning capabilities. It has a set of machine learning models for learning about task and workers, and also for fusing mission analysis with the noisy worker reports we are getting through crowdsourcing. In addition to this, it has decision heuristic planning capabilities for optimizing efficiency as we are solving these consents for these crowdsourcing tasks. Crowdsynth is designed for a special subset of crowdsourcing tasks that we call consensus tasks. In consensus tasks, the test governor has a question in mind. It doesn't know what that correct answer is, what the correct answer of the question is. But it can go to the crowdsourcing system. The crowdsourcing system can hire multiple workers, and the idea is if it can ask sufficiently many people, the consensus of these workers will give us the correct answer of the solution. And in this system we are paying money or we are spending time with each worker. So hiring each worker is costly. We see examples of consensus tasks in many crowdsourcing applications in game though the purpose, in paid crowdsourcing programs like Mechanical Turk or instance applications like Galaxy Zoo. However, solving consensus tasks effectively is not trivial. First of all, in addition to having these noisy workers providing us input, we could also have some automated task analysis. For example, we could have a machine learning model in addition to these workers that are providing hypothesis about what the correct answer is. So there's a question about how to fuse these automated analysis with human input to get to the correct answer as cheaply as possible. Also, as some other speakers pointed out today, not all workers or all tasks in these crowdsourcing systems are equal. Some workers are better than others and some tasks are easier than others. So there's a question about how do we aggregate these people's reports together to get to the answer quickly. And finally each worker that we hire provided some evidence but that each worker is costly. So there's a question about, and I think somebody in the audience pointed out this question before: How many of these worker reports are enough, when is it fine to say I'm done, I'm just hiring this many people and this is the answer that I want to give to the task owner. So crowdsynth is designed to address these questions. It has access to some automated machine analysis, but it also has access to workers in a crowdsourcing system. When a consensus task comes into the crowdsynth system, it can decide to hire workers and in return it gets a worker report and each step of the process it needs to make a decision between hiring an additional worker, paying more money, versus stopping, getting a prediction for the correct answer and giving it to the task owner. And crowdsynth has multiple components in it. It has task and worker databases that stores historical information about the tasks, what their correct answers, what their consensus answers has been, and what workers had set for these different tasks. And then it has a future generation component that gets the automated task analysis, the sequence of worker reports collected for the task, historical information coming from the databases and generates a set of features describing the current state of the consensus task. In addition to these features, now we need to have some predictive models for making sense of these features. So we have two particular models, answer models for predicting the correct answer of the consensus task at any point in time as more worker reports are coming, and also for accessing the confidence of the system about the correct answer. And you also have both models for predicting the future execution, for saying if I hire another worker, what that worker will tell me. And these next models help us to say -- predict the future and say what am I expecting to happen in the future and will that improvement in the future be enough to compensate for my additional costs in the future. So it will enable us to make that kind of an analysis. And we have a decision heuristic planner on top of these predictive models that makes, that makes these hiring decisions innovative, optimizes the final utility of the system. We are performing this study of crowdsynth ideas on a crowdsourcing, well-known crowdsourcing platform called Galaxy Zoo. Galaxy Zoo is a project that's introduced by the Universe Team. The Universe Team has been phenomenal for introducing a number of very influential customs and science applications to the Web and they have actually energized a large community of students and scientists, student scientists what they mean regular people in the world they have Internet connections and they want to make a difference in science and they want to contribute to the science. So they have these platforms for engaging the regular people in the world to scientific research. So this Galaxy Zoo system is one of their most successful applications. The goal of Galaxy Zoo is asking regular people in the world, these student scientists, about the type of galaxies. The system, when student scientist comes to the platform, the Galaxy Zoo system shows an image of the galaxy. There are millions of these galaxies but there are not enough experts in the world to classify all the galaxies. So the student scientist looks at the picture and says: I think the correct classification of this galaxy is elliptical or spiral or merger. And the idea if it can ask enough people about these classifications, we can classify them correctly and actually they have experiments showing that the student scientists classification match the expert opinion, so that it has been very successful. And since the launch of the Galaxy Zoo project, hundreds of thousands of people have contributed to Galaxy Zoo. They've successfully classified millions of galaxies, and this resulted in discovering new galaxy types and new astronomical objects and this led to more than two research papers, increases our knowledge about how the galaxies evolve in time. So we want to thank the Galaxy Zoo team for giving access to this beautiful dataset to us for our studies. In this dataset, there are more than 34 million worker worlds collected for nearly 900,000 galaxies, from 100,000 unique workers. The first step we take is generating a set of features that are describing these can consensus tasks. So the first set of features we have is task features. These task features are the result of automated task analysis which uses computer vision algorithms to look at these Galaxy Zoo images and says, gives us different features representing the images, for example, how much noise these images have or what is the radius of a galaxy you see in the image. And these features are coming from this low digital sky server, and we have 453 visual features like this. This represents the automated task analysis component. In addition sequence of these words each class, others. to this, we defined a set of word features that gets the votes collected for a task so far and then represents how look like, what is the distribution, number of words for the entropy of the distribution, the mode class and many And we also defined a set of worker features who said based on the historical data that we have, what is the usual time these workers spend for each task, how much experience they have, what has been their accuracy for different kinds of tasks, and you also define some word worker features which uses words, word features in addition to the worker features to generate better descriptions. So now that we know how we are doing the feature generation, I will talk about building the predictive models. In consensus tasks, there are a number of different uncertainties. For example, the system is always uncertain about what the correct answer is. And also it is uncertain about what the next worker would say if he hired additional one. So we are building these predictive models for representing these uncertainties. Answer models predict the correct answer and word models predict the next word. This figure on top presents one typical model would be rebuilt for predicting the answer and next word, and as you can see these models can bring these different features together to make the predictions. And we use supervised learning with bayesian model selection to train these models. For predicting the correct answer, we took two general approaches. The first one is building a generative model that makes an inference based on a prior answer model just using the automated task analysis features and a word model which predicts the next word based on the correct answer, task features and word features. And then we implemented two very well-known generative approaches in a bayesian model which makes an independent assumption about the votes coming for a task. And an iterative based model that does not make this assumption. And then we also try the discriminative approach which learns a direct model from all the features available for a task. In this graph we can see the different accuracies of these models when the number of words are increasing from zero to 60. So one important lesson from this, as the number of words increases, we can see a benefit for using a discriminative approach because the words are coming together in synergetic ways and the discriminative model is able to represent this. But in addition to this, for these consensus tasks, these tasks are really difficult. We may need a very large number of worker reports to be able to predict the correct answer accurately. And in this figure we can see two discriminative models, one trained for the case where we have a few number of words, and the second one trained when there are large number of words available. As you can see, when a few number of workers are available, the task features are really playing an important role in predicting the answer and the word. But as the number of words increases, the word features become the bit predictors for the correct answer. So now that I've described how these answer and word models are trained, now I want to talk about how we will make decisions by using these predictive models. So the goal of the decision-making component is optimizing hiring decisions to maximize expected utility for solving these consensus tasks. And here we are defining the utility as the final accuracy of the answers produced by the system minus the payment we are giving to workers. So we don't want to hire a lot of workers because they're costly. We can't hire infinitely many people, but at the same time you want to get to good accuracy and the system needs to do a good trade-off for different cost values. And we are modeling consensus tasks as a finite horizon mark of decision process with partial observability. The state here includes task features and words. At each step the system keeps a belief about what the correct answer is. Here the system has two high level actions. The first action is hiring a random worker and continue the task. The second one is not to hire worker and terminating the task. The reward for taking a hiring action depends on the cost of a worker and the reward for not hiring any more workers depends on the confidence of the system on answer prediction, and the stochastic transitions of the model is determined by the next word models. So as you can see here, the predictive models become inputs to our decision heuristic planner. And the decision heuristic planner makes a decision about whether to hire new workers or not. And this decision about which action to take is guided to a value of information analysis. So for each state the system decides what is my value for taking action H versus action not H, and I should take the action which provides more expected value. However, just using regular MDP or pro MDP approaches for solving these tasks is not possible. Because the horizon of these consensus tasks can be very large. And we know that when there's partial observability, the complexity of solving these models grows exponentially through horizon, which makes the exact solution approach infeasible. For some tasks we have up to 90 words, so we need to do a horizon of 90, which is not feasible. So what is our solution for solving these tasks efficiently? The solution is understanding the special structure, because consensus tasks have a special structure. There are two high level actions: Hire a worker or not hire a worker. The task terminates when not each action is taken. Each hire action as new evidence and answer predictions are most accurate at the horizon when all words are collected. So by looking at this special structure, we designed a new Monte Carlo planning algorithm that can solve consensus tasks efficiently. So that's what this algorithm does. It starts from a state SI, the current state of consensus task, and it generates samples. Each sample represents one feature possible execution of the system, if the system continues to hire as many workers as possible to reaching the horizon. Starting from the state SI, we say, okay, I'm hiring the worker, let's sample what that worker would say, maybe that worker will say elliptical or spiral. Updates state, take another higher action, sample another worker word, update the state until reaching the horizon. At the horizon, we have a lot of evidence about the task. At that point from the belief at that state we can sample a correct answer which is A bar. And now we will use the sampled correct answer to backpropagate and it will negate the value of previous states are for determining at any point. So backpropagate this A bar to the state ST and say if you terminated at that state, if you took action not H, what that state would say is the correct answer, does it agree with the sampled correct answer? If it agrees, here's a good reward for you. If it doesn't agree, there's a bad reward for you. And we do this for all states encountered on the sample. And if at the end we can assign some utility to these different states we've seen on the sample, and then later, after we generate many samples like this, we can bring these samples together in a search tree, partial search tree like this. Here the branches represent the stochastic words that we can get from workers. And starting from the bottom of the tree, we can evaluate for each state on the partial search tree what the value is for hiring workers or not hiring workers, make decisions like this. So what are the properties of this algorithm? It is an anytime algorithm. With each sample we can explore states at different times, which is a good property when the horizon is so long. And a single sample can evaluate value for multiple action sequences. So by using these properties we can solve consensus tasks sufficiently. Now I will show you results from an empirical evaluation with the Galaxy Zoo data. We compare crowdsynth to two baselines. No hire baseline which hires no workers, just uses the automated task analysis. Hire baseline, which is the original Galaxy Zoo system, and we try different crafts and variations, limited look ahead approach which looks at 16 steps ahead, UCT, which is an upper confidence bound Monte Carlo algorithm. Monte Carlo trees search algorithm and our novel sampling bayes algorithm. And in this graph you can see the utility of the system when the cost of hiring a worker calls for a label decreases from 1 to 0.01 cents. And as can you see here, when the cost of a worker is not very high, our algorithm and crowdsynth can do much better than all the baselines and all the Monte Carlo planning algorithms. Why do we see this? Because our system can successfully trade off the expected benefit worker with the cost. As you can see here, the blue line is the accuracy of crafts. Red bars represent percentage of votes collected. And as you can see, when the cost is really high, this system does not hire many workers but it can still increase the accuracy up on the baseline. As the cost increases, it increases the number of workers it hires and gradually the accuracy increases. And, finally, at the end, when the cost is really low, crowdsynth can reach the maximum accuracy of the original system by only hiring 47 percent of the workers. These are a similar analysis when there's no cost for a worker, but there's a fixed budget and we showed that crowdsynth can do much better than other baselines. So I know I'm out of time. So I will just conclude with the feature work or current work slide. What are the things we are working on now? This version of crowdsynth works if there's a large set of historical data. You're interested in how we can use the same ideas when such a dataset is not available. So we are interested in a design of a system that can learn and act simultaneously. You want to go beyond consensus tasks, for example, the picture on the right is another example from a universe platform, where it's no longer a consensus task, it's a discovery task. And you're interested in other prediction problems like predicting engagement of workers. And we have an ongoing collaboration with the universe team, they're very interested in trying these ideas in their platform, so you'll see these ideas come true in their systems. So thank you very much and I'd like to have your questions. [applause] >>: It sounds like your system really is beneficial if the task is hard, you really need multiple, multiple judgments, and also it assumes sort of one judgment at a time. Collected three workers, three judgments and decides whether to go for the fourth one or fifth one. In practice, you usually cannot really get only one because you have the problem with Watson style that you have to keep returning to. So you have to [inaudible] >> Ece Kamar: How do you need to get -- >>: Righteous [phonetic]. Because you have to keep looking -- you have to keep filtering the box. I guess you can use, I guess you can change it -- the question is what would change if you have, if either have like very small budget, we really cannot get past the labels and also what would change if we have to, we cannot request one label, we have to request like five labels, three labels >> Ece Kamar: So the ideas would be pretty much the same, and you can hire 50 but you can hire five. Your decision-making problem will get much simpler because you just need to think about five steps ahead at most. So you don't have many of the challenges we've seen and the problems. So actually the task becomes simpler when you can hire at most five workers. And what you've seen here, even for the Galaxy Zoo task, there were times when the system was confident enough to say that I'm done after hiring one worker, because the automated task analysis was already confident that it was elliptical, the next work, the first worker system hired, it's elliptical and the system said yes, with these calls I don't think I can improve much better by paying more workers, just stopping now after one worker. So even when the number of workers you need to hire is not that large, you can still generate benefits by using an approach like this. And coming back to the question about batches, such a system would be more beneficial when you can actually dynamically adjust what you want to do after each word because you have more freedom about how much granularity of workers you want to hire. So the same idea would still work: Instead of hiring one worker you want to hire a couple of them, that would still work. But at the end you may naturally choose all the benefits you would like to get out of the system just because your decisions are not that granular. >>: Looks like your action space, you're looking at is hire or not hire, and that's fine. But in practice often I have a fixed budget: I do the best I can. Have you looked at instead of trying to determine which tasks to give more of a crowdsourcing active learning type of -looks like similar techniques could work with that didn't seem to be in scope at all here. >> Ece Kamar: Here, because of the limitations of the dataset and Galaxy Zoo platform, we can't really target individual workers. This is why we've performed our simulations like this. But in separate simulations that I haven't presented today, we build next word models that were personalized. So we were actually predicting what this particular worker would say for this galaxy and then customize our decision-making model by also making decisions about which worker should I hire if I can hire between these 30. And you can actually see benefits in that. >>: You would hire which is closer, it's not what you could do is it's not about what tasks, what actions, what questions do I give you. >> Ece Kamar: That is like the analog problem looking at it from the task, for the same ideas would be applicable affording that kind of analysis, too. >> Ofer Dekel: [Applause] Let's thank the speaker.