>> Ece Kamar: Hi, everyone. It's my pleasure to introduce Walter Lasecki. Walter is a PhD student at University of Rochester. He is working with Jeff Bigham and James Allen. He is a second time intern at MSR. He is also a Microsoft Research fellow. Walter is mainly working on interactive systems that are powered by the crowed and machine intelligence, and he is going to tell us today about those systems and the work he has been doing with us here at MSR. >> Walter Lasecki: Awesome. Thanks, Ece. So yes today I'm going to be talking how can you use crowdsourcing to actually enable robust, interactive, intelligent systems that are versatile enough to be put in the wild with real users and see what happens. So this is work in collaborations with both Jeff Bigham and James Allen. I'm going to start off today with a little bit of a background on human computation; then, I'm going to go through some of the existing work in crowdsourcing, something that's kind of laid the foundations for what we're doing these days; and then, I’m going to look at how we can use this crowd agents model that we propose to enable truly deployable intelligent systems; and then, a little bit about future work on that front. So human computation is in many ways a field revisited. In his book "When Computers Were Human," David Alan Grier kind of outlines the history of computers not as machines necessarily but kind of the role of a computer or the work of a computer back when it was someone who did computations. And it's a very interesting history because of course this role goes way beyond machines. So eventually people were replaced by machines because for raw calculations for simple math, machines just really can't be beat by people. But there are still things that humans can do that computers can't yet. So we're seeing a resurgence in using people in a computational process. Now more recently you get a lot of human computation kind of in the form of crowdsourcing, but even crowdsourcing isn't new. Distributing to a large group of people is what the Works Progress Administration did in the 20's to compute large scientific tables. And they have had workflow processes that were focused around distributing the types of tasks that each worker was doing in a way that prevented similar errors so that you could recombine everybody's answer and get a more reliable result. So some of the differences between that and what you hear these days is, well, crowdsourcing is now an open call. It might be on the web. It might be to your user base. It might be to just anyone who wants to come along and help. But in this case we're often warned about these malicious users especially in kind of semi-anonymous online work; you don't have any guarantees that this person on the other end will actually do your task, will do it correctly or has any expertise to do it. So you get quotes like this: so workers "are inherently lazy and will avoid work if they can. Workers need to be closely supervised and a comprehensive system of controls developed. A hierarchical structure is needed with a narrow span of control at each and every level." Workers "will show little ambition without an enticing incentive program and will avoid responsibility whenever they can." This is a very pessimistic view of workers but the interesting part is this is not about web workers. This is actually from 1960 and it's talking about regular full-time employees. McGregor was at MIT Sloan I believe when he wrote this book. But Theory X, the way he saw it, was not necessarily the full answer. And I'll get back to this before the end of the talk. But it's just kind of an interesting idea that we've already looked at workflows in this same light; we've already looked at workers as having needed these tight controls and incentive mechanisms. So what's different now? What do we know now that we didn't know 30, 40 more years ago? Are people smarter? Did they suddenly overtake computers in their ability? That's not it. Computers can still do what they could much faster. In fact if anything, the gap has closed between computers and humans. But what we do have is more technology around us, more computer systems, more distributed systems that all can communicate with each other and the user in a way that lets us integrate people and the way people work in a different way. So if you think about a traditional workflow model or a traditional hiring model: you hire one person and if you want to make your system more intelligent, if you want to answer queries with human intelligence then maybe you have that person sitting at a desk just waiting for you to ask a question. And they wait there all day and maybe if you hold odd hours, you have to have multiple shifts of people just being paid to sit and wait. Of course that doesn't really work but now with online labor markets such as Mechanical Turk, you can hire a person to do a very short task and there is someone always available on demand. In fact you can even get a lot of people on demand and parallelize tasks that would have been impractical parallelize before because you would have needed to hire all of these employees full-time. Now there are differences here, of course. If you hire a full-time employee, you have some guarantee, at least some expectation that they're going to be around. But unlike traditional employees, crowd workers might come and go as they please. You never know exactly who's available, and often you don’t learn enough about any individual workers to really have a guarantee on their expertise. But you can at least get these workers fast. And you see work like Adrenaline and Quick Turk It that have looked at actually being able to bring in a new person but you don't know anything necessarily about that person. So the traditional fix to this is to take your task and break it up into a lot of little different pieces that people can complete quickly. And as long as they're completing it quickly then you don't have to worry about this high chance that they'll leave before the end, before the task is accomplished. But if I distribute this to a group of workers then I don't want three answers to come back most of the time. Most of the time I want a single answer for my question. So you have to start looking at ways to combine the multiple workers which might provide you with more confidence but in a reasonable way. Sometimes you might average, sometimes you might take a vote and select answer, but either way you're getting more confidence than you have a single worker by using multiple people. There are of course different ways to do this -- Once the task is complete you can put it back and select the new one. But, there are multiple ways to do this. So for some tasks instead of looking at completely parallel work flows where everybody gets the same piece of information, they complete and then they all compare answers in the end essentially or the system compares answers for them. We can start to look at systems where there's an iterative process. You get slightly better answers as you take a task and pass it from one worker to the next, to the next and eventually you converge on an answer and you can mark it as done and move on to the next piece. Now this takes a little bit longer but you might be able to get more expertise in a lot of cases when you have people building on each others solutions. So in addition to just flat models like that you start to get more and more complex models. Soylent used the FindFix-Verify paradigm to have one group of workers select the task, pull it off and then, have a different set of workers actually identify what a solution to that problem is. They could finally pass it along and have a different group of workers verify it because one of the problems with an iterative model is that at any point if I'm passing it along from one person to the next, one of the workers could simply delete the content or make a change that makes it worse and makes it harder for other workers to complete their task. But this kind of verifies that at least within each loop you have some improvement. So now that we have the crowd and we can even get reasonable responses back from them, what do we do with it? Right? Great, we can bring human intelligence to bear on problems but what types of problems are most interesting? And for this we look at artificial intelligence because in many ways they have the exact same goals, and they've kind of dreamt up what they want but maybe don't have the final answer. So there's a lot of overlap in interesting problems here. So just as one example, conversational assistance have been a goal for a very long time. So Eliza was the late 60's I believe. The problem with systems like this was that they were very, very constrained. So you had a template of what kind of interaction the system expected and if you didn't get that then kind of all bets were off. If I start asking Eliza about booking a hotel then all that happens is we end up getting into a very shallow conversation about who was messing up the conversation first. And even if you fast forward to today, we have much better attempts at conversational interaction with systems like Siri or if you're a little less concerned about portability, Watson. But in both of these cases they're still looking at very specific formats for the conversation. They're still looking at very specific use cases if nothing else. But what we want is conversational assistants who can actually help us. We want whole intelligent systems that can not only come up with what our solution is but think about our intents, think about why we're asking this problem and maybe even answer a question that we haven't asked yet. So things like conversational systems are something that can make people's lives a lot more convenient, but there are roles where this type of intelligent interaction can actually be much more powerful, can actually have a transformative effect on the daily life of users. Jeff Bigham and I really focused on assistive technology for this reason because being able to provide a blind user with a way to interact with their desktop through conversation alone or being able to help a motor impaired user use a predictive keyboard that actually has a better sense of what you're typing, activity recognition for cognitively impaired and older users who might need a little assistance and prompting throughout their day to safely live unaccompanied, to things like captioning and transcription services for classroom settings where students who are deaf need to be able to follow along with the lecture but it's very expensive right now to provide this using a single expert human, and even to things like visual assistance where a blind user can query their environment and find out more about their surroundings. So the take away from this talk, I really hope, is that we can empower intelligent interfaces, this kind of more broad goal. But most of the examples that I will show you apply these intelligent interfaces to specific assistive technology problems. Remember, this is the basic model we have. Take task, break it into pieces, send to people. But how do you control an interface? How would I break this task up into a lot of little pieces that individual workers can control? And this problem is actually what got me into the crowdsourcing space because I looked at it and I said, "Well I mean I understand human computation. You take a problem that your algorithm doesn't quite solve, you put some people in it and then, you get yourself a mission accomplished banner and you're done." But it doesn't seem to quite work for this interface. I could break the task up by button let's say. I could have everybody controlling an individual button, but that doesn’t really get at the heart of the problem. We don't have any consistency between controls. If you imagine this is controlling a small commercial like off-the-shelf robot. But you can see one of the options here is to extend the camera. And that's great, but if somebody chooses to do that to look over a barrier while someone else what's to drive under it, you'll break off the poor camera. So we want consistency between actions. We want like the whole of what the system is doing to be self-consistent. But backing out if we had things like breaking it down by frame and I have workers look at the current situation and they say, maybe, that I need to drive forward, then we ask a whole different set of workers, "Okay now what do we do next?" Again, we have no consistency now across time instead of across the actions that are selected. So the difference between this and a normal task is that it is continuous. We have a person who is interacting with the feedback from a system. And they're getting this feedback from a system and they're getting this feedback as kind of constant input. And what they want to do -- and if this is just a typical end user -- they'll take off a piece of that task -- now maybe in the interface control it's knowing when to press a button or knowing how to react to some situation -and they'll perform a task and provide us input back to the system. Now another problem is that even that model is not as simple as it sounds because if you're not able to keep taking new tasks, they're not going to slow down. The environment is not going to change because of the way we selected to control the system has to wait for a set of workers to come back. So if this worker holds it too long, all of these tasks are going to back up and we're going to end up with incomplete sessions or something is going to be missed in the environment. Now in general if we want to crowdsource this, we can look at just parallelizing the input. So then the same time of task division only now we're doing it with continuous streams of tasks. So the task comes in, each worker takes off a piece; but then what? We can't give the system three inputs at once. We can't even actually know that the workers all grabbed the same piece of the task. Maybe they saw different problems. It's really this combination step that gets tricky. And at a high level what we want to do is just merge them back together, but what that looks like is very domain dependent. So we started about trying to fix this by wanting to abstract away the crowd entirely. So what we if we could look at the collective intelligence itself as an individual entity. We call this a crowd agent. So internally you have some specific -- there are two mediators: a task mediator and an input mediator. A task mediator handles whatever breaking up a problem in your domain looks like, passes that to workers and then, the input mediator takes the workers' output and combines it all back together. But of course both the task and input mediators are going to be very domain specific. So we have to start thinking about what do these look like? How can you generalize this model? But going back -- I'll say real quickly, by providing a single user, if you can behave like a single user, if you can take input continuously and provide output as a single stream you can see the inputs and outputs are basically identical to what you would have if it was really just an end user. So we don't have to change the environment; we don't have to change the task that's incoming in anyway. We can start to communicate with existing computer systems or existing users in a very natural way, and we just have to process this all internally. So looking back at the example of the robot: this is kind of one of the first pieces of work we did. And I'll have the citations along the bottom there so if anybody wants to kind of follow up. I'll be happy to answer any questions but that might be a nice resource. So what we're looking at is even more general than just controlling a robot. How can you control the interface that controls the robot? How can you actually take any UI that's out there and control it with the crowd? What we came up with was Legion, and Legion allows you to select a piece of your interface right here, just drag a box around it and then, you'll set some parameters like price and a small description for the workers and you'll set what keys they're allowed to use and what mouse input they're allowed to use. The system automatically starts to recruit workers from Mechanical Turk. And when they arrive at the page they see this little bit of information you've given them, so they see a quick description, some of the keys that they are able to use and then, they're given what is essentially a remote desktop connection to the system. However, of course, we mediate their input so that one worker can't just start controlling someone else's computer ad hoc. But the way we figure out how to combine this and how we compare people to find out if they're doing a good job is actually to use the group not as a set of input votes. We're not looking at how many people just pressed up; "Well, it is a majority so I should go up." Instead, we use it to select a leader. And then, we give the leader short periods of control. What this is does is essentially allows people to overcome small decision points where you get to a point where you can make two equally good choices. You can't get agreement in the crowd because there's no communication in this case between the workers; we just want to see what people would do. But we allow whoever is in control to kind of get around that point and then see if people start to agree with them again. So as we go each worker is providing some input. And you can see that we switch from yellow to blue here. We always ignore the red malicious worker at the bottom because there's just not enough agreement. The way we calculate this though is actually to take the input from a single worker and compare it to the aggregate. So we look at every input that was given over that time span which is usually about half a second to two seconds, look at all the input and we say, "Which part of that was theirs?" And we can compare this by looking at the vector cosine which is essentially the projection of the vector of input from the single user onto the crowd's answer. And we assume that the crowd is correct. Yeah? >>: So you have to get this whole bunch before you actually decide who's [inaudible]? >> Walter Lasecki: Yes. But we do is it's essentially the vote for the next period. So we go through and everybody provided input with one person in control. Right? And as soon as you hit the next piece -- We use the voting data. >>: Right. So if I say I want the robot to go right, up, up, left, I have to do all that in my head without actually seeing... >> Walter Lasecki: No, no, no. So we're letting one person control the specific instance but we're looking at how much they agreed with other people who were concurrently controlling to select who the next leader is going to be for the next one-second span. So you're being rated based on your historic performance and then we pick the best representative from the crowd going forward. But that's a good point, thanks. And using this we can do more than just kind of de-noise the crowd. We also looked at this kind of fun application of letting people play single-player video games as multiplayer. And it was kind of interesting we can start to do -- I think the reason I'm interested in this work is really we can start to use multiple Legion systems concurrently. So we actually are averaging across different subsets of the controller here. So we don't want everyone to have to agree on all of the actions. If you want to control a piece of the system, maybe you just want to work on like special abilities and jumping or whatever in a video game, you can work on just that piece. You don't have to participate in the larger task. You can specialize or you can collaborate with others. And if multiple people are trying to control the same subsystem then you can get the averaging. Yeah? >>: So it is confusing for the workers to give input and kind of see that their input is not followed? >> Walter Lasecki: Yeah, so... >>: Because if the person says, "I want us to go left" then I see it going right and I kind of need to make a decision again. >> Walter Lasecki: Right. >>: [Inaudible]... >> Walter Lasecki: Yeah, so you kind of have to adapt to... >>: ...[inaudible]. >> Walter Lasecki: Yeah. >>: Then how does that... >> Walter Lasecki: Mediated? >>: ...interaction go? >> Walter Lasecki: So it turns out that in a lot of these cases if you have reasonably low decision point -- Well, so basically a decision point is any point where you could have equally good options. If you have a reasonably small set of those throughout the course of the plan then you'll diverge very little from the crowd if you're trying to do the right thing. So if we're driving to the teapot here, they can avoid the obstacle in more or less one way. Now there is a little bit of leeway in there, but if you start to disagree too much, you have a radically different plan from the rest of the group or you might just not be giving a valid answer -- So the short answer is you actually don't always notice that it's not listening to you except for in select cases. And when you do, you can see that basically it acknowledges your input. It's not that the system didn't hear it. You see that, yes, you pressed forward. But it does tell you, though, the workers are informed that they're working as part of a group. So right now it's listening to the crowd. "We heard your up but we're going to listen to the crowd for this moment." And because you can't take too large a step away from what you were just doing, the replanning process for people is very simple. So if it takes a slight right there is not a huge cognitive difference there. >>: But the advantage of being [inaudible] some simple [inaudible], you want to have some kind of continuous decision making coming from the same person or is it [inaudible]... >> Walter Lasecki: So part of it is that we want to continuous decision process throughout. And what you get is because we're ranking similarity, the people at the top all tend to agree with each other. So even if you switch who is in control for that small moment, the general plan is usually the same. But the real big advantage is, yeah, you get to make these decisions online. If you stop and vote first, it greatly slows down the system. If you have to actually react to things in the environment, it slows it down. And what we were able to show, actually -- And I don't have the chart on here -- is that compared to other simpler input mediation strategies this actually is not only faster but more reliable. So the average time if you just gave it to a single person tends to be a little bit slower; it's like 10 to 20 percent -- Faster. Sorry -- 10 to 20 percent faster than if we did it with the crowd using this mechanism but they only complete 40 percent of the time. And in our trials I think we completed 100 percent of the time. So you get kind of the combination of speed and reliability. Other approaches that were similarly reliable, took two-three times as long often to even complete a simple task. Did you have a --? All right. So it can be applied to other interesting domains. But one thing that Legion kind of relied on is the fact that all of the workers could concurrently complete the same task. And we do so in a somewhat consistent way. But what if your workers are not actually able to do any of them or actually unable to do the task itself? And we see this problem in captioning. Professional captionists are really highly trained to years of training in classes to be able to type 300 words per minute which is actually physically impossible on a regular QWERTY keyboard. So they also have to train to use these coding keyboards, the stenographer machines. I don't know if you've actually ever seen one of those. So we're not going to find any of these people on Mechanical Turk. We certainly are not going to find five of them. And if you consider that the average stenographer charges something like 100 to 300 dollars per hour for their services, we don't really want to find 5 to run concurrently. But using the same model where we have the task broken up and then sent to many different workers, we can actually just have workers type what they can. And we can divide up the task or just kind of have people self-select from the task. It turns out the dividing is a little bit more consistent because you get natural saliency adjustments throughout the stream. So I, for example, might always caption the first word and everybody in this room would too but we might miss the tenth word after that. So we can divide up the task. We can ask people to type however fast they can, 40 to 60 words per minute is plenty. And then, by using multiple workers and an alignment process that stitches everything back together, we can actually still cover the complete stream. So we can get this full answer and what we do is use multiple sequence alignment. There is a separate paper here that looked at how we can use an A-star search to actually improve this to the point where can run it within under a second and get some error bounds on the correctness of the alignment itself. And this works very well. And if you think that what we are looking at here is maybe three to five workers able to cover three hundred words per minute then -- Excuse me -- if I'm paying students that can type that fast or Mechanical Turk workers a reasonable wage, maybe 10 dollars an hour, then we're at 50 dollars or less per hour to get the same service that we would've been paying 150 for before. And this is also to make the problem approachable by more workers. So I couldn't contribute to professional captioning even if I really wanted to, even if I would charge less. No matter what I did, I couldn't do it. But by lowering the barrier to entry using this system what we're able to do is say that we have a much bigger worker pool and we can make this service available much more on demand. What you end up getting if you go into a university, for example, and you ask them, "What would it take, accommodations, for a deaf or hard of hearing student to just get captions in the classroom?" And it turns out you have to give at least 24 to 48 hours notice that you need this, and you have to be on file in advance with the disability service office just to be able to get the support. And it's because they can't schedule anyone any fast. They're not always available and they might be overbooked. So we can actually make this service not only cheaper but more available by using nonexpert workers in this way. So kind of along the same lines looking at tasks that individuals cannot complete themselves: we have activity recognition. The problem here is not that people can't identify the activities just like they could hear the words in Scribe; the problem is that they can't input the information to the system fast enough. So if I want labels for everything that's going on in this scene right now, most of us would be pretty hard pressed to be able to type that as it happens and as they switch from one action to the next, especially if you went into action level not just activity level. So what Legion AR did was it used the same type of process to run a lot of these -- You take the same stream. We color code it so that workers can focus on an individual. And then, you type as many of the actions as possible. Now this is a lot more general. We don't divide up the task in any specific way according to time. We just let you watch the stream, and you type as many tags as possible. What workers will see kind of on the side of the video will be labels, and they see what other people have typed previously, what the system's guess is because we're actually starting to get to the point where we're integrating existing machine learning techniques here in the form of a hidden Markov model-based activity recognition system. And that's going to not only allow us to try to take a guess with the system and then have it confirmed by people, but it's actually going to let us learn over time. So the crowd is no longer providing its own stand-alone service; it's providing a training service. We don't want to be monitoring people forever using just the crowd. What we'd like is when something new happens we can bring the crowd in on demand. They can cope perfectly fine. They know what payment looks like as this person reaches for their wallet or leaving the room or whatever; they can identify this, all the different forms it comes in when the system can't. So they give an answer. The system learns, "Okay, now I've seen an instance of that." And we're starting to work on systems that actually allow the crowd to give us a little bit more understanding about the same problem where we don't just want to know what the label is but we want to know what it means to have performed that action in a given order. So, was it necessary to do one thing before another? And we start to get dependency relations between all this. And again we can use that to train and existing system, in that case, a lot faster. So the system -- Sorry, go ahead. >>: [Inaudible]. Does the crowd tend to converge on a granularity of actions? >> Walter Lasecki: Yes. Although, what we were doing is we trying to focus them on a granularity so we actually provide like two or three examples that might not be exactly in the space. So for monitoring this room we might give kitchen labeling examples just because that's what we thought of off the top of our head. But people are able to kind of take the line "making breakfast" instead of "cooking eggs in a frying pan" or "lifting pan" or something like that, and they can map that on to the new domain. So we prompt them a little bit to tell them what we want but, yeah, they are able to converge. And they're able to converge to consistent labels because once we have kind of a guess put into the system we actually let the system propose as a worker. And I'll get to that a little bit later. But they propose as a worker, so between sessions you can make sure that you say "making breakfast" instead of "cooking eggs" because that's kind of the label that was agreed upon. And people tend to not change the answer unless there's something wrong with it. If it seems like a viable answer, they'll stick with it. So in all these systems that we've looked at so far we are able to give natural language commands to the system whether that be in the case of these examples for activity recognition or directives on what to control or what the goal is for Legion in the control task. We can actually capture people's speech using something like Legion Scribe. But what we don't have at this point is the ability to get feedback. So we want the crowd to be able to clarify a task, clarify what they're supposed to be labeling. In Legion if they hit a blockade and they can't go past while driving a robot, let's say, maybe we want to let them request or ask the end user, "What do you want me to do here? Should go find a different goal? Should I keep searching? Should I give up and save battery?" Because especially dealing with users who may under-specify the problem, the whole point is that you don't have this formalism up front. We want to actually let people just naturally request things. So the question is can you actually get natural language feedback with a crowd? It makes sense that you could but you, again, don't want six or seven or a hundred parallel conversations. You want one conversation. You want one clarification process and often in real time. So to look at this we made Chorus. Now Chorus is a system that actually allows us to do exactly this: it allows us to have conversations with the crowd. And it does so by having multiple workers in the background all proposing and voting on each other answers so that the crowd is actually self-filtering in this case. And what you see is the conversation that the end user sees. This is actually a hybrid view. So everything in white here -- I don't know if you can see that -- is something the crowd said, but everything in this conversation appears to end user, appears in the final conversation. To make sure that workers are consistent and are actually able to be onboard it since we don't when people are going to connect and going to leave, we also provide them with this working memory that lets them kind of pass notes to each other about what important things have happened in this conversation. So this is how we can maintain consistency across multiple sessions. But how we maintain consistency with multiple workers who might want to have different conversations, who might have different ideas of the solution is, is we actually have a backend that lets people vote on specific answers. And there is a point and monetary reward system that just uses a very simple incentive mechanism to make sure that people actually get bonused for contributing useful answers. And this bonusing actually allows us to do two things: one, it allows us to select just the consistent answers. And I know it's kind of hard to see that these are pink. But just the ones in blue boxes were actually accepted. And so we have the ability to vote on these things, but the ability to actually reward people dynamically for their contributions allows us to pay workers appropriately within this continuous model. So you might do a completely different amount of work. You might stay for a completely different amount of time than another worker, but instead of kicking you out and making you take a new small task that asks you post a response or something, we let you stay attached and we pay you based on how much you do during that time. Now what we tested this on -- And this is going to appear at UIST this year -- is actually conversations with a dozen users who are asked to come in and kind of follow the script. But the underlying task was a search task. They were trying to find a piece of information. They were trying to plan a trip. They were trying to just maybe even have a general purpose conversation to help them make a decision. So one example is, "Should you get a dog in an apartment?" This is not something that you can just kind of go banter with an existing system, like you can't talk to Siri about if you should get a dog. It will simply search that question. And it can adapt and you actually get a lot of very interesting serendipitous information finding where because we have multiple workers they'll come more information than they would have as an individual. And a lot of times that information is actually more useful than the answer to the question that was originally asked. So as an example in testing we saw one time the person was asking, "What are different places to visit in Houston?" And after they got a list they said, "Great. How much does each one cost?" So they wanted a set of prices to come back. What one worker ended up finding and the group promoted and passed through was actually a city pass apparently that gave the user all of the same activities that they were looking for plus a few extras and was cheaper than giving the individual prices. So the system didn't just do what we asked but the workers were actually able to figure out that what they came across was a better answer and proposed a response to a slightly different question. But it's not just a conversation partner. It can actually act as an interface. And we started to look at applying this to other domains where we want continuous interaction with VizWiz. So VizWiz is a system by Jeff Bigham, et al -- like ten other people -- that allows blind users, from their cell phone, take a picture, ask a question about that picture and then get a spoken response from the crowd. It's actually voice over that speaks the response. The crowd just provides text. This is actually an extremely popular system right now. We have had this out for, I think, two years. And it's answered over sixty thousand questions and had five or six thousand users, I think, the last time I checked. And you can see the types of uses that people have for this. So maybe it's trying to figure what denomination of bill they have. Maybe it's reading what the stove is set to or reading the nutritional information off the back of a can. And while this system has been extremely successful, there are situations where an image alone is just not enough information. It, as you can imagine, is very hard to frame images for a blind user. So they might take a picture of part of a package, ask a simple question: "What are the cooking instructions?" in this case. And the crowd gives a response like, "Well, the instructions are on the other side." Okay, fine. They can't tell so they flip the box over. There are no real tactical clues as to where this answer is. So they flip it over and give an image. Now a completely different crowd of workers, a completely different set of workers gets this question. They don't know that the user was just on the other side. So they look and they say, "Well, there are no cooking instructions in here. Flip it back over." And this can go on. And I think was almost a 15-minute interaction. And yet, it's such a simple question. If you had another person next to you that you could interact with, you could very easily get this answer. But because the tasks are broken up, because the crowd does not have any continuity between it, between the different groups answering, it becomes very difficult to answer even simple questions. What we did was we built Chorus: View which essentially took the conversational interaction of Chorus and linked it to streaming video. Much like having someone on Skype, you can actually have a conversation about some item. So you look at this here. What the worker sees is a continuous video stream. They can take some screen shots if they need to kind of look at something a little bit closer. But then they see this conversation on the side and it's just chat messages only the user is going to interact with the crowd via voice. So they get played this little message, hear the question and then provide answers. And you can say that this one was voted in but this one was not because one person already found the answer while the other just couldn't read it. Again kind of going back to the serendipitous discovery, it turns out that a group of workers even with low quality video streamed over the web can do a very good job about retrieving this information because it turns out that somebody takes the screen shot at the right time or somebody sees a piece of information or can interpret a piece of information or recognizes the package in some cases. So what we were getting are responses that we, as the designers, couldn't have answered from video but some crow worker was able to. And then, you can see that this type of interaction allows the user to take a picture -well, not take a picture but start the stream, aim somewhere on the package and then, if it's wrong they get an adjustment, "Okay, well rotate the left." Or maybe it's, "Rotate until we tell you to stop." Maybe it's, "Rotate 90 degrees." Once they do that, we're using the same crowd and the same group of individuals who know what just happened, know what pieces of the package they checked and can kind of get more perspective because over time you can understand what you're looking at even if you can't make out the specific piece of information and you get responses a lot faster. And this is going to be at ASSETS this year. So what's next? Well we want to finish kind of formalizing this model. You can think of this in terms of being a dynamic system or modeling this using control theory. But we want to know what can you get out of this crowd agent model if you formalize it? Maybe people are too uncertain but you can still give some error bounds if you assume something about the workers themselves. And hopefully this isn't a very flat assumption like a lot of incentive models are where people will strictly optimize for some fact. And then, we want to start really deploying these systems in the wild. We have been able to use real users to test these things but we haven't just seen what happens if you give people an intelligent interface. And I don't think anybody really knows yet. Right? We've always had these very limited-scope, very limited-release systems. Siri at least had a large release but only handled something like 15 functions, and they're reasonably straightforward. And then, we need to address some of the security and privacy concerns that this brings up because if you look at all of these systems, you know, we're streaming video to the crowd. We're giving personal information maybe about location so we can find a restaurant via chat. Again, this is video. We're streaming audio to the crowd. We have all this visual data even in the form of images which might contain information that you didn't want to share, either personally-identifying information or something like catching a credit card in the background of a picture or catching something about maybe a mailing address because there's a letter left on the desk here. So it's very easy to capture information that we maybe didn't want to make public. So that's what I've been working on this summer. And I've been working Jaime and Ece on this. And we started by really just wanting to know what are the threats. If you think about attacking a system you would need one malicious worker to be able to find out information from a task. So if somebody accidentally captures their credit card, it only takes one person to corrupt that. But if wanted to actually give the wrong answer, if somebody takes a picture of a can of peas and you want to make sure that that person really thinks it's a can of corn for whatever reason -- which might actually matter for Azure reasons and all this other stuff -- what you really need is a majority. So there are two different manipulation types but where can we find workers for this? And I should point out that this is to find a piece of information but finding the task that contains that piece of information is much more difficult. You probably have kind of a needle in the haystack problem. So you want one worker per task maybe but you need a lot of workers to find those tasks. So in either case you have a bigger benefit from having a group. And one of the easiest places to recruit a group of users is the crowd itself. So could we actually attack crowd-powered systems with the crowd? So the first thing we look at is how can we extract information? And we want to go and we want to get that credit card. I've stopped using my own and I want to use somebody else's. And we've set up this task where when the worker gets to the task they are asked to click this big button. That immediately forwards them to a task posted by a different requester, and that's just a very simple like, "Tell me what's in this image." Maybe it doesn't even know. Maybe the requester of this inner-task doesn't even know. This target here is being attacked. And we want to know, can we get people to return us the same information to the attacking task? So the first example we looked at was this hopefully very easily identified fake credit card. It's not real. It's drawn. The number is 12345; nobody is going to really confuse this with someone's actual credit card. And we want to know, are workers more or less willing to answer this question compared to something that looks like a real card. Now this actually is not a real card but they didn't know that. And it looks pretty real. This person is holding it up. It's even got the hologram and everything. So what do you think? Like, can we get workers to do this malicious task? Or are they somehow perfectly ethical? So to get the ethical task, it's kind of complicated. We have a lot of information here that we copied out. Sixty-two percent of workers will successfully give us all the information we wanted. But as soon as we make this a potentially harmful looking task, it drops to 32. So that's kind of interesting. So workers are paying enough attention to this task and are making some assumptions obviously about maybe we aren't linked. Maybe we don't own both tasks, right, because we certainly didn't post the same requesters. They have no way to know that we're the same person. And obviously they think that there's harm or at least this appears to be the case. Now what if instead of taking information out, we want to put information in. So what if I just wanted to give the person the wrong credit card number or maybe translate the word incorrectly? I believe reCAPTCHA had a problem with this, with people just typing in other words. So now the attack task is even simpler. It says, "Don't come back and tell me what they said. Just tell them that this word --" Can anybody tell what this word says? Anybody want to take a guess? Okay, so it's a little ambiguous. It could easily be gun or fun or sun or lung or fir. Somebody -- You know? Sun. So, "Tell the person requesting the task that this is sun." And I kind of, you know, can see that. Sure. On the other hand our second condition was, "Look at this word." And I'm sure more people can make it out. "Also tell them that's sun." Right? It's basically the same. So 75 percent of people were willing to say that this was the word sun. It seems kind of viable until you consider that if it wasn't suggested only 12 percent of people actually said that this was the word sun. Most people actually thought it was gun. Okay, so they're suggestible. Maybe we [inaudible] a little bit. Maybe we convince them that this was the right thing to do. Maybe they were just willing to listen to us because it didn't seem harmful; it seems like that's a plausible answer. Not as many conditions over here. They can't really think that this is a viable answer. And sure enough we see we dropped it just under 28 percent of people who were willing to give us the answer we asked for when it's clearly incorrect. So this is on some levels promising. Right? It means that there are workers out there who will see this as being a kind of ethical decision to make between these two different tasks and will respond at different rates. Or maybe they don't want to risk it; they want to do what they're told by the inner-task. Let me stay there for a second. So that was interesting but we still have about 30 percent on each of the crowd that's willing to do whatever we just asked them to do. So what can we do about that 30 percent? What can we do to safeguard against that part of the population? And the typical way to post a task is you start -- you have some potentially sensitive information. You send it to random people on the Internet and you had the potential for bad things to happen. What we would like to do instead then is pass this sensitive information -- Let me try to handle this -- to a filter. And then, using that filter out all the pieces that we don't want the crowd to see, come up with a more anonymized task which is then passed to people on the Internet and we get our task solved without having to sacrifice privacy. Okay, so as an example of that let's take this real simple little piece of text. Let's pretend this is hand-written out or whatever and, "Hi, Bob. Here's my username and password. Alice." Now anyone who has been anywhere near a cryptographer, cryptology class, anything knows that Alice talking to Bob is a recipe for disaster. It never ends well. And there's always Eve who is where somewhere off to the side listening in waiting to steal that information. The problem here is that we can't just encrypt this because Eve is mixed in with the rest of the crowd and we don't know who that eavesdropper is and who is just trying to help us complete our task. So what we'd like to do is anonymize this information. We take out all the sensitive pieces and now Eve has nothing to do. The problem is this also potentially limits the crowd's ability to solve the task. But if I get a response that says, "Well, I've parsed your piece of information. You have two facts in here." And it says username and password; both are stars Well, that doesn't really help me as now the end user. My information was processed correctly. Kind of my anonymized information was parsed and now I got something back that wasn't as useful to me because I don't know which pieces were used here. So we go a little bit beyond just filtering, and what we want to do is actually replace the sensitive content with semantic tags. And this will hopefully allow us to track each piece of information with the system on the backend. And then, when workers actually refer to a piece of information it's like, "Okay, well, female name, username is such and such." Now the system can repopulate it before the end user sees it again and they can kind of see with some degree of security exactly what they expected to when they put something into the system. It comes back as, "Well Alice's username is AStern." But the crowd doesn't get to see that. And we want to do that by actually dividing this up into small pieces maybe only giving each crowdworker one word at a time and having them slowly filter and complete the task -- filter and replace these semantic tags as they go. So if I see a name, I can label that as such. And if I see -- Maybe I don't know this is a username until it backs out a little bit and we get increasingly larger context for a piece of information and we filter it as soon as we can. All right. But even if we filter out all the information -- This is kind of going a little bit beyond the internship here and beyond what we've previously discussed. So even if we can filter out this information and we have like an anonymous crowd who can give us responses that are good responses, they don't reveal any personal information, we still have a scaling problem with crowdsourcing. Kind of in the extreme, imagine if everyone on Earth had a crowd-powered system running from their phone. Where do you find the people? Where do you find every person's system? Even with timesharing, it just doesn't work out, right? Now that's pretty far in the future, but even for cost reasons we probably want to minimize the amount of people we hire for these tasks. So can we actually increase automation? Can we use this same model that has allowed us to stop thinking about the crowd as a collection of noisy workers and just look at it as a single entity that gives us a reliable answer to actually provide training data or maybe even go beyond providing training data and provide understanding structure to the final system. Now within the crowd agent model itself what we can do is when we look at this general process, right, we have a bunch of people -- a bunch of entities -- completing tasks and they're doing so in a continuous system. But the model itself doesn't actually know whether this person is a human or a machine. So if we can start replacing some of the contributors, some of these noisy contributors that we're assuming with noisy automated processes, even if they get the answer wrong our expectation is that the system as a whole, by using agreement with human workers who can kind of catch these errors, maybe provide good input themselves, we don't have to in this merging process propagate that error. And what this means for an interactive system is that we're able to use the system, use the system with maybe flawed or only partially capable AI in the background so we can complete a piece of the task but not the entire thing, and all the noise generated by that, all the wrong answers are removed from the system. They never corrupt the end user's experience. So even if we have to pay a little bit more every once in a while for more people, we can do that hopefully as little as possible and we can do that in a system that never looks like it makes a mistake or at least has a reasonable error rate let's say, whatever we expect from humans. So just to briefly recap here: I started by talking about the background of crowdsourcing and how interactive systems can go beyond this kind of divide and batch process model that has been mostly completely pervasive so far in crowdsourcing recently. And then, we talked about how we can really use this crowd agents model to go beyond just simple interactions of the crowd, become more interactive with our systems, actually power intelligent systems that can speak back, can have conversations and can be consistent across long ranges of time, and how that starts to point to kind of truly intelligent systems that we can put into the wild. And right before I finish up here, I promised I'd get back to this quote. So this is kind of interesting because I think this points to kind of the interesting future of crowdsourcing or something that we as a community should look at going forward. This is actually wrong. Empirically from years and years and practice, managers don't think this at least not any more. If you oppress your employees, if you take each one and say, "I'm going to micromanage you so that you never waste a second," and you put these systems of control with a narrow span of control on workers they don't actually more, they're not actually more productive. And McGregor didn't actually believe that this was the answer. He believed this was part of a spectrum, and kind of the other side of this was what he called Theory Y. And that says workers are self-motivated. Workers want to be doing a good job; they have some pride in their work. And what managers are supposed to do is not put them into these rigid structures of control but actually just enable them. And I think for people working here that's certainly closer to what their day to day experience is than being told every minute of the day what to do and maybe having to file TPS reports. But I think just now crowdsourcing is starting to get away from this thought and come into this belief that workers can be leveraged in different ways, can be leveraged maybe as more self-motivated employees if you just compensate them fairly. It will be interesting to see what the crowd community can learn not just from management science but also from cognitive science, psychology, operations research and all these other fields that I think have set a lot of the precedent in the same type of work but in a different domain where now crowdsourcing can come and extend this to more complex tasks, more dynamic tasks and really change the way that we do business. Thanks. [Applause] I'm happy to answer questions. Yeah? >>: [Inaudible] what if you had where the crowd collectively was initiating an attack? Does the system have an ability to defend itself? >> Walter Lasecki: So for example if you had [inaudible] or anonymous or somebody kind of come out. Well, so if you're looking at the method for creating a private task like the filtering that we were looking at, that's kind of going through a self-selected crowd. So we're putting this on Mechanical Turk. If Anonymous wants to then try to corrupt Mechanical Turk, it passes that back [inaudible].... >>: [Inaudible] simply by doing volume. So you're crowd source is only so large naturally. So like you say, [inaudible] or say a group at Reddit decided they were going to try and attack this system, they could in theory overrun it by simply being the largest group of users, by being 60 percent of the users. They could overrun any particular path. Does it have a way to identify feedback from users to say, "This is wrong." And then, you get enough of these things and you say, "Our source has been compromised right now," and then it flags the system or something? >> Walter Lasecki: So it's possible that because we saw this difference where some workers were willing to kind of back out of a task and they were actually passing up payment to not complete the task that we saw. So either the extraction or the manipulation is ample. It might be possible to ask workers to essentially flag information like that. So maybe we get a very small percentage of the group coming back and saying, "I see what other people are doing. I see that this is a problem." But then you have to make it transparent to other workers what people are doing. In general if you have to take over 60 percent or even just 51 percent of the crowd there's a certain cost implied. So you can always just kind of [inaudible] up how big of a workforce. You can also use known good workforces. So if maybe I have small internal employee pool and then I'm using the crowd to kind of do the really heavy lifting, the filtering task can be done internally or by more trusted workers and not actually give access to those kind of group or to any service that could be corrupted by those type of groups. That's one way. >>: You showed that people are less willing to do a task if they think it's a bad thing like the credit card numbers. Are they more willing to do a task if they think it's a good thing like helping a disabled person? >> Walter Lasecki: Right. So I did not add that slide. But we actually have initial data kind of from survey results trying to look at if you give workers a set of tasks, it's all the same of task but there are just little bits of flavor added. So, "This task will help a blind user" or, "This task will help a large corporation," and looking at the fact that there is seemingly a preference difference. People will go for the assistive technology application. And certainly in experience we've seen workers go out of their way to come back and ask us, "You know, if you have another task, let me know." So yeah. There is some bias there and we're now kind of working on experiments to show that. >>: Coming back to your last slide about how [inaudible] workers are [inaudible] around 1960's. I think if you take the works that people do in jobs also change [inaudible]. Like people are more engaged. You kind of change from [inaudible] factory to maybe more an intellectual job or something that keeps your mind occupied or keeps you engaged. I don't know if there's a progression like that in crowdsourcing too where the tasks started being these very small things for a single person who's not actually very intellectually challenged by what they are doing. But hopefully we will give them tasks that will be more interesting or engaging than being these small pieces. So I'm kind of wondering if that'll have some kind of an influence too. >> Walter Lasecki: So in my experience I think it will because if you look at some of the feedback that we've gotten from workers in these continuous tasks where they can kind of stay for longer periods of time. We're not constantly interrupting them. We're letting them kind of invest what they think is necessary in the task. So if you want to go and really do a lot of research to bring an answer back to Chorus, that's great. And we'll try to reward you for that. But it's not necessary. Like, you could just give the basic answers. You can opt in when you want to answer and when you don't want to answer, that kind of thing. We give a lot more freedom. We give a lot more kind of providence over the data, I guess. You're really not seeing the individual worker contribution, but the workers themselves see that they get more points and things like that. So there are, I think, a lot of different levels of engagement and a lot of different levels of commitment to a specific task because you can stay for a longer span. So that hopefully is even supported by the continuous model. But, I agree. Yeah? >>: When you're using the crowd agent to train the hidden Markov model for the video, did you find that to be a reliable method of training? >> Walter Lasecki: Well, so what we were doing there is actually having the crowd come up with label sets that were exactly the same as an expert would have. So they're basically -- It's a correlated time range with a label. So because -- And I kind of mentioned this in my answer to John. Because we can get consistency across what we were calling the activities by taking a guess with the system and trying to like pull people towards consistent answers, then we could kind of re-label making breakfast every single time we saw a different type or different course of action that went into making breakfast. We were still able to come up with the same level. So the system could get trained. It didn't have like 400 different labels for a single activity. So then, at that point it just kind of converges to what an expert would do right now. But we can do it online which is [inaudible]. All right. >> Ece Kamar: Thank you again. [Applause]