>>: Each year Microsoft Research host hundreds of influential... including leading scientists, renowned experts in technology, book authors and...

>>: Each year Microsoft Research host hundreds of influential speakers from around the world including leading scientists, renowned experts in technology, book authors and leading academics and makes videos of these lectures freely available. [music] >> Kori Inkpen Quinn: Okay. I think we can get started. We'll very excited to have Walter Lasecki interviewing with us today from the University of Rochester. Walter is a student of Jeffrey Bigham and James Allen. He's no stranger to many of us here. He has done two internships and was an MSR Fellowship winner. Walter also one best paper at WISS 2014 and his work which you will hear more about in a minute is related to continuous real-time crowd sourcing. >> Walter Lasecki: Thanks for the introduction. Thanks for having me out. Today I'm going to be talking about how we can use human intelligence to augment automated systems, that even go beyond the boundaries of AI and even solve problems that neither people nor machines can alone. The resulting intelligent systems can be used to solve important problems today, such as providing better access to technologies to people with disabilities. But it can also scaffold future intelligent systems when we are thinking about how we build and train them. Why do we want intelligent systems in the first place? We can do things like have more contextualized and natural interactions with computers. We can offload some of the low level tedious tasks that we have to do and focus more on our high-level goals. We can even convert one form of media to another such as converting speech to text in real-time with closed captioning. While these are really interesting and they are very generalizable impacts of the systems can have, I'm particularly interested in how we can use these technologies to actually help improve accessibility, whether that's accessibility for low literacy or low technical literacy individuals who might need a more natural way to interact with information than the typical interfaces that we use, or even helping people with disabilities, such as people with motor visual impairments who need to control interface that isn't necessarily as easy for them as it would be for us, or even providing captions for deaf and hard of hearing individuals. The point of all of this is that this requires solving AI hard problems. These are problems like understanding intention in natural language or understanding the content of visual scenes. We're just not there with automatic systems just yet. We need human level intelligence. But fortunately, in recent years platforms like Mechanical Turk have come out and allowed us to actually access human intelligence and insight very, very quickly using an API call. Now you start to think about algorithms that integrate this human intelligence as part of their function. We call this human computation. I said you can do this quickly, but originally, actually, it took hours or days to get responses and only in more recent work by folks at Rochester and MIT has it been shown that you can actually get this down to actually just a couple of seconds to get people to show up at a task. In fact, last year I released some of the first publicly available tools that allow anybody to go in and recruit small groups of subsets of the crowd in advance so we can have people on demand exactly when we need them. We can get people to a task quickly, but it's very different to manage people in a computational process if you compare that to hardware in the more traditional sources of computation. That's for a number of reasons. First, a lot of people really use these platforms because of the flexibility that they allow in the way that people work. You can arrive at any time as a worker. You can pick tasks that are really the ones that you're interested in and nobody's saying you have to stay around. You can leave whenever you want, which means we may not know if a person is going to complete a task or come back for the next task that we would think of as being a part of the series. Even once we get a response, it's not necessarily true that we know what the skills of that individual are, how much effort they are putting in, maybe they are even trying to game the system. And sometimes people just make mistakes. This is a configuration that actually leads to them seeing the problem differently than we expected. This means that we don't know if we're going to get our task completed. We don't know what order they are going to be completed in and we really don't know what the quality of the answer is even if we get it. So the field is really focused on how we can solve these problems by taking a task, breaking it down into small context free units of work that only take a few seconds or a few minutes to actually complete, called micro-tasks. And then using that divisibility and that context free nature to actually let workers on these platforms take the task in whatever order they feel is best. This works pretty well. But then we realize once we get these answers back we don't have a good way to confirm that they are actually correct. In most cases it's actually just as hard to confirm an answer is correct as it is to solve a problem in the first place. So instead we are going to use the fact that these are small units of work that are actually designed easily comparable pieces. What that means is that we can actually collect a set of responses to each of these questions and use a simple aggregation scheme such as voting to get a more reliable answer that we can really have faith in. This basic idea has been used in a wide range of systems that we have seen so far. There are things like coming up with arbitrary image labels for images on the web, answers to visual questions for blind users, even finding just the right moment to take a picture with a smart camera. It's also been used for things that are nonvisual, as an example, NLP and linguistic tasks, finding linguistic annotations for text data sets, editing documents or even translating the documents to another language. This is really just a small sampling of the space here. There is a lot of work in the HCI and AI communities on these types of systems and really we have seen in the past couple of years an entire conference sponsored by AAAI, HCOMP and even an HCOMP Journal arise to support this community. But while micro-tasks are extremely powerful, they are fundamentally limiting the types of tasks that we can think about for crowd sourcing and using human computation as part of the process, because not everything can be broken down into these little pieces of work. I'm going to make the claim that we can actually create realtime systems that support multi-turn interactions. Even though we have a very dynamic crowd, we can maintain context and maintain interaction with these systems. I'm a systems builder, so I have built a number of large web-based systems for coordinating workers in real time around a single task over the last four years. And for each of the systems, we kind of look at a way that we can solve a very generalizable problem and a whole space of problems while also solving a specific instance of a challenge that we had not known that we could use crowd sourcing to solve before. And then for each one of the systems I built I've looked at what are some of the other problems in that space, the actual application space, what are some generalizable properties about the crowd we can use to inform future systems, and even, how do we get around some of the hurdles that we see in deploying these systems in real domains? Since I can't talk about all of this in one talk, I'm really going to focus on explaining how maintaining context is critical to a wide range of tasks. Then I'm going to introduce a new type of task, actually not micro-task, but a continuous task that keeps people engaged for longer periods of time and goes beyond a lot of the assumptions that we made for micro-tasks. And then I'm going to talk about how this allows us to start thinking about going beyond the traditional impetus for crowd sourcing, which is we have a task that people are good at but computers really aren't, and even looking at tasks that individual people are not necessarily good at. I'll talk about some of where I want to go with this work in the future. Let's start with maintaining context, and I'm going to use the example of VisWis which is this system that was originally introduced in 2010 that allows people, blind users to take a picture, speak a question and get an answer within about a minute, which at the time was state-of-the-art. I helped deploy this system in 2011 and the resulting data set of almost 20,000 images asked by thousands of blind users is really fascinating for a number of reasons, but in this data I found a set of cases where it really shows that micro-tasks are not the end-all be-all of human computation. And I'm just going to give this one example here where a user asks, what are the cooking instructions on this TV dinner? It's actually a pretty popular class of questions, and understandably, the information needed to answer this question is very hard to frame for a blind user. There's no tactile feedback on the actual package of which they are taking a picture, so they don't know that they didn't capture the cooking instructions. When the system goes out and recruit say worker or two to answer this question, they say I can't see the instructions. But then they usually will try to give a little bit of feedback on how this might be corrected. They will say why don't you flip the box over? So the blind user flips the box over, again take a picture and ask how about now? But the system doesn't necessarily recruit the same workers and it's not sure that that person is always going to be available. Instead, it goes back to the crowd, it posts another micro-task, gets a different worker who then doesn't know the original question. So they say something very general like it's a box of food. So the user goes in, fixes this, realizes what the problem was. They re-ask the question, take a picture again. The system goes gets another worker and they say well, I can't see the answer so why don't you flip the box over? These kinds of classic interactions can actually take a very, very long time. In fact, even though we get answers to every question back in about a minute, it turns out that this typically takes about 10 to 15 minutes before a user can actually get an answer to their question or they just abandoned the task altogether in this case. The core problem here is that the end-user is trying to have a conversation with the system that maintains context when the system doesn't actually support that type of interaction. I'm going to focus in on this problem and look at how we can solve the challenge of actually holding a conversation that goes and takes multiple turns and maybe even spends multiple sessions. We go away and we come back tomorrow and want to continue that conversation from where we were. To explore this I created a system called Chorus which actually is a virtual and personal assistant powered by many crowd workers behind the scenes. To the end-user this actually just appears as a chat conversation with a single individual. They have a pretty standard instant messenger window and they are able to interact with the system. Behind-the-scenes workers also participate in this process of curating a sort of working memory. These are in important facts. It's kind of a predictive task where workers try to speculate on what would be important to future workers who are completing this task. That might include capturing things like allergy information or location information about things that they shared in the past. To actually hold a conversation it's a little bit more complicated than just posting a chat message. Instead, we actually have a collective process where workers are all proposing and voting on one another's answers. We eventually have a rolling voting scheme. To incentivize people to give us the right answers, the best answers that they can, we have a mechanism that asks people to propose answers and if they actually propose an answer that's correct we'll gives him 3000 points. They don't get any points if other people don't agree that they have given a reasonable answer to this task, or given the current task. Since it's a little bit easier to just vote for one of these answers and it is to just go out and find information and bring it back, if a worker votes on an answer then they will get 1000 points, again if that is marked as correct. To make sure that people don't just randomly guess at what the answer is if they don't know, we also add in this kind of no-op bonus. It gives workers a little bit of a reward for not doing anything because it's pretty easy to figure out when people are abusing it. One of the nice things here about this incentive mechanism is that behind the scenes we are actually able to tune a single parameter to figure out the spacing between these rewards. And what that will effectively let us do is dial in the verbosity of the crowd, so how many different responses do we want? Do we want a large set of maybe creative responses, or do we want a pretty narrow set of really reliable responses? And while there are some issues if you are looking at the actual game theoretic, properties of this mechanism you will find there are multiple equilibria, but empirically this actually does work to encourage people to propose more or less answers. What does this all mean in the end? It basically means that we can filter down the set of responses to just the ones highlighted in blue here. We really remove things like this line where somebody comes in right in the middle and says well how is everybody doing? That clearly is not what we wanted right in the middle of the conversation, so other workers catch on to this and they filter out that response. To the enduser we only see this consistent conversation that is actually accurate and is sensitive to the correct points in the conversation that we are talking about and actually what we said in the past and how that fits together. We brought a dozen users in to our lab to actually study how the system works. We give them a high level objective of what they want to accomplish. We didn't give them a specific script as that would kind of defeat the point. We ran this within subject design study. The results were really interesting and I am just going to give one example here to give a flavor of how some of this turned out. The user comes in in this case and asks about some activities that they want to do in Houston. They basically ask what the prices of doing this set of activities and after a little bit of clarification the crowd is able to come back and tell them that will cost $150. Interestingly, while that's the answer to the question the user actually ask, a moment later the crowd actually comes back and says actually there is a better answer. There is a city pass available from the local tourism office and that allows you to do everything and actually a little bit more than what you are expecting and it's cheaper. So this is an answer that the request from the end-user did not actually expect to need, but it was kind of serendipitously found by the system by this collective search process behind the scenes that can discover information even if the end-user didn't ask about it explicitly. And because of this serendipitous information finding, we actually find a majority of our participants preferred Chorus to a keyword search like Bing. More quantitatively, we can also show that the system is able to stay on topic, is able to give reliable answers that are very accurate and answer what people are actually asking about. And interestingly, we are also able to recall 80 percent of the facts that were actually from a prior session and no workers were present when the user originally sent a piece of information, but they were able to recall it using this working memory window 80 percent of the time, so we didn't have to re-ask. We didn't have to rehash the same ground. It is possible to support this type of multi-term interaction even though we have a very dynamic crowd behind the scenes that is constantly changing. And if we look at the roles that we actually have people completing in the background, given a question, some people are going to propose a new answer and others who are going to help filter that answer down so that we only forward the correct answers to the end-user. It's easy to see also how we can use this same structure to include things like dialogue systems. It doesn't matter that this isn't a person now. If it produces the right answer, then humans will go in and filter this and make sure that we are only getting good answers to the end-user still. But if it produces an incorrect answer, in this case this is the wrong city, although it sounds like the right city. It's the same name but it's a city instead of a region, then workers can check on this and find out it's not the right information and propose a new answer and then filter that so that we know the human answer is correct. Not only do we prevent the system from every giving the end-user an incorrect response and we prevent the system from ever corrupting the end-user's experience, we also get the advantage of actually having this as training data. So where we would have normally derailed conversation and gone down a conversational path that doesn't really fit what the end-user wanted to accomplish, we can now stay on track and see what would have happened even after the system makes a mistake. And yet we still know we made a mistake because the system's answer was not voted on and forward. This gives us kind of a new way to think about how we deploy and then train AI systems in the wild. Now I want to talk about a new kind of task, a continuous task that goes beyond a lot of the assumptions of micro-tasks and lets us solve a whole new space of problems. This is an example of something that I think we can't solve using micro-tasks alone. I'm going to use the example of driving a robot. This is very important for home assisted robotics where you might have somebody with a mobility impairment who needs a little bit of help actually accessing things in their home. But how would you control this with micro-tasks? It's a naturally continuous space. We have to be able to sense the environment. We have to be able to listen to the user and decide what we actually want to do about that. We want to provide continuous control to the robot which does not expect discrete input and we have to actually be able to respond to events in the environment as we encounter them. As a proxy for this home assist robotics task I'm actually going to use this little off-the-shelf Rovio robot where we have a little bit of an obstacle in the way and we have to drive around the obstacle to get to a target. Instead of breaking this down into little pieces, we're actually going to give workers control of an actual streaming video interface that is from the front mounted camera on the robot and we're going to let them use the default controls that the system comes with which are basically just arrow keys and a few other simple controls. If we do this and we give control to a single person, what we find is something like this where the majority of the time we actually failed to complete the task because people will join, give it a good first shot to solve the problem and then maybe in this case they failed to complete. They crashed into a wall and then they abandoned the task. This is a top-down view of our set up in our lab with a trace of where this particular instance of the navigation task went. This is because there are going to be some noise in the system. Not everybody is going to want to stay around and complete this entire task and if it's not easy to do they might just have a higher rate of abandonment. If we use the fact that we have lots of crowd workers available, we should just pick a new one from the crowd and let somebody else control whenever they abandoned like this, we get something like this. We find out there are trolls on the internet, basically. Somebody here navigates the robot almost of the target, but it's back and forth and back and forth and back and forth and then abandons the task somewhere there a few minutes later and finally a different worker will connect and finish a task. This of course was something that raised our completion rate in this case because eventually somebody will finish most of the time, but if we could drive this robot off a cliff it would be pretty problematic, right? It works here because you can't destroy the task. What we really want to do to make sure that we don't have this kind of malicious input or otherwise noisy input is usually aggregate people's responses. If we use simple voting in the case where we break something down into micro-tasks, we want to use something similar here, but a lot of the standard approaches for dealing with continuous input don't necessarily work here. I use the example that if the robot drove up to a tree it can go left. It can go right. There is a perfectly good set of options, but we don't want an average. Instead, we're going to take a vote, but that requires actually discretizing time, so this is going to be a much slower approach where we need to wait for one second of input, look at the most popular answer and then use that to control the robot. The problem is still that we are not taking a consistent stream of options that anyone worker would have selected, so we have a lot of conflict early on here, where when we were trying to decide how to get around this barrier, do we want to go forward first? Do we want to go to the left first? People ended up crashing a lot in this case. That doesn't quite work. To solve this problem, we're actually going to borrow a page from representative democracy and we're going to pick one leader who is in control at any one time but then not make that necessarily the same person over multiple periods in time and I'll show you how this works briefly. If we wanted to really solve this problem, we would have to solve this POMDP where we really want to get agreement between different worker's policies, but of course, we can't see what people would do that every point in time. And the fact that there are different representations in people's heads and so we actually don't even know exactly how the states line up with the real world, so this isn't very tractable. Instead we're going to try to approximate this by looking at people's input over time, comparing it to the rest of the group and then picking which are the most representative of the group to be in control. Given that we have some input and we have some funny properties of this input, such as we can't really assume that these two up commands just because they look the same are not necessarily the same. We can't assume they are the same because going forward a moment later might be a very different action than going up at the first moment. So instead we're just going to bend this over a 1 second time period, look at each worker's inputs for that time period and use a similarity function here, cosine similarity to actually compare them to the rest of the crowd, so we get some score as to how similar they are to the rest of the individuals who are contributing. Then we'll use that score to actually update a weight and we do this actually for every time point and we do this for each worker in the crowd. This means that at any given time point we can just pick the highest weight worker and give them direct control as the kind of representative leader for that period and then as we keep updating these weights, we'll get essentially a new person in control at any given 1 second span, but we won't be interleaving people's input on a finer grained scale. The system I created to do this was called Legion, and Legion was able to not just control a robot, but actually an arbitrary user interface by letting me select a portion of your desktop, give a natural language command for what you want completed and then it would actually go about the task for completing that with the crowd. We have used this to control anything from office software to assisted keyboards for people with motor impairments to even letting multiple people play a single player videogame collectively. The idea of actually being able to keep people involved in a task for a longer period of time, also points to this idea of a new way to maintain context. If we had one person here for the entire task, then they could remember what was happening, the same thing for the conversation. To set up a little bit of a test for that I actually used a videogame set up. We created a custom map here and basically the idea was we had people using Legion to control a videogame character. They were walking in this cyclic map and it took about 30 or 40 seconds to get from one decision point to the next and the decision they made was essentially to push the white button where the black button depending on some prior instance that they saw. We actually ran this for a full hour and it turns out we can complete this task even though people were not present for more than a couple of minutes. At the time we could complete this task consistently and reliably the entire time. Here you see an input where, a graph of people's input. You can see workers on the y-axis there and each red marking is basically a point at which the worker contributed some input to the system. And we see that maybe worker stays for a few minutes and then they leave and somebody else joins for a few minutes and then they leave. But the way we were able to actually control this for a longer period of time is that few look it when people actually join a task, versus when they start to contribute, we see that there is basically this synchronization period where they are watching what other people are doing to learn how to complete the task and then they use that to continue on in the same way. This is kind of reminiscent of organizational memory from the behavior literature where this is essentially the process that organizations and societies used to pass traditions down and other types of behaviors. This really gives another interesting way to capture this idea of context. I talked about so far two very different ways to accomplish this goal of maintaining context over time. I want to zoom out for a little bit and talk about some of the more general framework that this fits into. Specifically, if we wanted to create these systems, what is the architecture? What does it look like? We are always going to need a way to divide these tasks either into small pieces or into actual roles that people synchronously coordinate and complete. We're also going to need a way to collect input from workers, an interface that lets us get input from contributors and contributors could be human workers or they could be automated systems that know how to do all or a piece of this task. In either case we will need reward schemes to make sure that systems are incentivized correctly or workers are paid fairly for their efforts in the system. Then we're going to need to think about the communication channels between people and how we actually think about maintaining context without allowing collusion, without allowing people to really coordinate to the detriment of the incentive mechanism that we are using. And then we are always going to need to actually aggregate these responses. Whether we're thinking about controlling an interface or about holding a conversation, we don't want parallel threads in existence. We really want a single control stream. We really want a single conversation. This idea of having a single input and a single output is the same idea as having a collective that act as a single individual, which we'll call a crowd agent. This idea kind of harks back to the agent architectures that we see in the AI literature in prior work. This gives us a way to really think about how to design these systems in the future and how we can design for interactivity specifically. Then I want to go back to this idea that I started in the beginning about providing visual persistence. To blind users and this problem that we saw where people were thrashing over kind of corrections and follow-up questions, and think about how we can solve this using crowd agent framework. To do this I created a system called view that actually engages the blind user with a crowd of workers who actually hold an ongoing conversation you see on the left here. About now not just a single image but actually a streaming video, so they can now give multiple corrections. They can stay around to see what the result of their feedback to the user was and it makes it a lot faster to hone in on exactly where the correct information is. So if we're looking at tasks like finding nutritional information, ingredients, allergy information, things like that, we can actually reduce the time it takes on average from 10 to 15 minutes with a prior single image VisWis system down to about 1 or 2 minutes, so you've got an order of magnitude speedup your because of this interaction. Everything I've talked about so far and really all they prior work in the field has really focused on how do we complete tasks that people are good at? But I want to go beyond that and actually look at cases where individuals might not necessarily be good at a given task that we want to complete. But computers aren't up to that task either, so we need to kind of start using the collective ability to go beyond what we can do individually. As an example of this I'll use the problem of real-time captioning. This was a very important accommodation for deaf and hard of hearing users, but it's really very difficult when you need a low latency, maybe a few seconds a word and that requires typing hundreds of words a minute because people speak very, very fast and we need input to be accurate. So generally if we take a person it's probable that they are not going to be able to do this test very well. I don't think any of us in this room could actually keep up trying to type this talk, for instance. I'm specifically going to focus on a classroom scenario where we have a single speaker and at least one person in the audience who needs this accommodation. Currently, automatic speech recognition, while there has been some really amazing advances recently, is just not up to the task. Because of the variety of different speakers, the different speaker conditions, for instance if the speaker has a cold, for example, the different lexicons that we might be using and vocabulary that we might actually have any given lecture and the different acoustic properties of the environment that we are not sure of in advance, really makes this a very challenging problem. Automatic speaker recognition does not reach the legal accommodation under the American with Disabilities Act for being a viable service here. Instead, the current state-of-the-art is actually using people. It's computer-assisted people. Professional CART captionists are individuals who have trained for years to be able to type hundreds of words per minute within a few seconds of each word. But because of this training and because they're so rare, it means that they are very expensive. It costs a couple of hundred dollars an hour to hire these people and they are not easy to schedule, because again, if you have to find somebody with this rare skill it takes 24 to 48 hrs. notice and that's kind of the standard accommodation latency that you see like in a university setting, for example. This not only means that it's difficult to afford these accommodations for people who actually have to provide them, disability service offices at universities, but it also means that we can't always access them, not just because we have this lead-time, but because of the cost itself. A lot of times you'll see that universities don't actually accommodate students that they want to work in teams after class. The in class is supported, but it's not a legal requirement to support after class and this is not something that is typically done or feasible given the budgets of these offices. What we like, instead, is for students to have the ability to walk into a class, walk into a group setting and pull out their phone and get captions immediately. To do this I created a system called Scribe, which actually streams audio from a user's mobile device to a server that then breaks this task up over multiple non-expert workers and then coordinates all of their input and aggregated back together to provide a single caption stream within a matter of a couple of seconds. This means that we can make these captions available at any time. In fact, because we're using non-experts, they are highly available and they are also cheaper. We can pay students 10 to $15 an hour. This is not something that they have spent years training for. It's just anybody who can here and type. To get a group of people to do this is still costs about a quarter of the price of a professional. We can now use students or other volunteers who might actually have subject matter expertise. So if we are captioning computer science talk or a mechanical engineering talk, we can find somebody that actually knows something about that domain, rather than has a specialty job just in typing. But how do we actually coordinate people to do this task? The simplest high-level version of this would be to do a round robin. So we start with one person, give them a few seconds of audio that we want them to type and then say don't worry about typing while somebody else is taken care of the task and we'll hand this to the second person and say okay now you have a few seconds to type and so on and so forth and eventually, the first person will catch back up so we can integrate them back in and have them type a few more seconds of what they hear. Of course this type of coordination requires a lot of interface accommodations for figuring out how to coordinate people and actually prompting people to input at the right time. So I want to show you a little bit of how Scribe does that in this video. Specifically, I want to focus on the fact that typing will lock-in, so as people type you will see it just kind of go gray. They can't go back and edit because this takes far too long and increases the latency by too much. It also provides a lot of cues to type and some feedback in terms of bonus points and rewards to encourage people when they are doing a good job and this can map to money in the case that we're actually using Mechanical Turk workers or other paid crowds to contribute to this task. You'll also notice that the volume will increase and decrease to help more with the saliency and this idea of queuing people to type at a certain time. And that's not working. This is supposed to have audio. I'm not sure why it doesn't. >>: The television as he knows one way generally entertainment oriented and paid for through advertising although increasingly by subscriptions now with cable television. And finally, the current big thing, the internet which came to us starting in 1969. The internet itself has been… >> Walter Lasecki: That was a little bit less content than I expected, but you get the idea. We're prompting people. We have a bit of a salient cues in the volume as well as some direct visual cues. As you see, people still make mistakes. People still make mistakes because this is still a challenging task even for those few seconds. If somebody speaks to quickly we actually go beyond someone's working memory. To make this easier I looked at what people who do off-line captioning actually would do, and it turns out that they slow down the audio. Off-line captioning is the same task as real-time captioning, but without the time constraint. You give me the captions tomorrow and that's perfectly fine. While slowing down the audio works in that case, it's not necessarily true that it can work in real time. We would obviously fall behind. So we use the fact that we have multiple people, not just a single captionist to actually slow down one person is hearing to say half speed just for the section that they actually are typing, basically. So if they are supposed to be typing a certain segment, we play it at half speed. If they are not, we take advantage of the fact that people can listen a lot faster than they can type. And we actually speeded up to one half times, while other people are responsible for the content. This means that actually we can play the audio at half speed for everyone while they are supposed to be typing, so everyone hears it slowed down and you still have that context. You can't remove the context in between the segments. It turns out people are actually almost as bad as computers when you do this. By allowing people to hear a more approachable version of the audio, we can actually increase the recall rate and the precision pretty significantly on these tasks. What's even more amazing is that we actually see a decrease in the latency, so we're slowing down the audio and yet we're getting faster responses. And the highlevel reason for that and we went back and did interviews with a lot of the workers we were engaging for this task and actually looking at some of the data, it turns out that a sickly when the audio is too fast for somebody to keep up with what they'll do is they'll listen, they'll memorize everything and then they will start typing after their segment has stopped playing, so basically in the little context period where somebody else's typing. That means we pay a shipped penalty where we don't have the person typing at all until after we have stopped the audio clip. It turns out when we are able to slow this down it's a more tractable task and people can type each word as it comes in more often, so we get this decrease in latency. I want to show you quickly what that looks like. It'll be another video. It's basically the same as before but now with this playback speed adjustment. >>: After that with the television network which arose during my lifetime and television as you know is one way and entertainment oriented and paid for through although increasingly by subscriptions now with cable television. And finally, the current big thing is the internet. >> Walter Lasecki: The current big thing is the internet. People can adapt very quickly to this, they can follow along and can actually complete this captioning task easier. But there's still a pretty challenging task here. People can still fall behind. They might miss a word. They might not know for certain word and it's pretty hard to catch. But we can't scale the same approach that we have. We certainly have a lot of people available now because anybody that can here and type, but we can't scale this division and round-robin approach to use too many people because if you imagine a 10th of a second to a large set of people, the 10th of the second of audio is not making it easier to caption this content. Instead, what we're going to do is rushing it at redundant input. We have a few seconds but now we're going to give a few workers per a few second segment. And collectively this allows people to very accurately and with a lot of high recall actually provide these captions. And 95 percent of the words that are said we can capture with 87 percent precision and we can actually do this in under 3 seconds of worker time per word latency. This is actually kind of on par precision with a professional and better recall and latency. >>: What is the difference in cost? >> Walter Lasecki: The cost does increase. Of course we have a lot of flexibility in the price we were paying. We already had a quarter. We can do this with a combination of not just workers but also ASR which I'll mention a little bit, but it's still much cheaper than providing this using professionals and it's much more available and higher accuracy because of this expertise, or subject matter expertise, specifically. But of course the problem with it is that now we have multiple streams of captions, which is we talked about before does not work for many applications. And it turns out that if you bring students in to actually read captions, read maybe 10 seconds of captions, it takes them 45 seconds if those are parallel captions. They can reconstruct what was being said but it's so slow that we would lose our real-time constraint just on the reading along. So what want to do is actually add this combiner phase where we are going to merge all of the words that people have said back together and create a caption that is kind of easy-to-read and a single caption. We are able to do that using multiple sequence alignment which is this process that was originally used in computational biology to realign genome sequences. And while this will figure out where we have gaps and where we are missing words and maybe even allow us to align words that we want to compare and de-noise in case somebody have like a typo, for example, it unfortunately, all of the existing work on this has been off-line. So dynamic programming algorithms that allow us to do a late binding answer but we have to have all the input that we want to align first, which doesn't work for our real-time case. Instead, I came up with a graph-based approach that actually constructs a graph as we see words based on which words we see immediately next to each other in different worker's inputs, weights the edge we can actually go back with a language model and re-weight the edges so that when we find the highest probability path this results in a pretty reliable caption. This work pretty well and actually works in much more general constraints. More recently we have been using this A Star search based approach with a beam heuristic dash that let's us more accurately control the computational and time resources that we use for a given alignment and integrate the language model in a more principled way. This gives us our single caption. And we have actually been able to show that this is very useful. We have used this in real domains, so we captioned a number of conferences including Assets last year, which you see here. This is basically our screen on the left and we are captioning for the entire conference and the speaker's off to the right somewhere there. Interestingly, instead of hiring mechanical Turk workers and instead of bringing people explicitly in to this session to provide captions, we were able to go into the audience, ask for volunteers and get five students who had never used the system before. This was about 5 minutes in advance of the session and they all sat down and they were able to produce pretty good captions even with only a few minute's notice. This really points to the idea of maybe democratizing access technology. Right now anybody who's interested, friends, peers, family members can help with this task. Whereas before, it really wasn't viable to do so. >>: I've got a question about how that works. People were in the session and so did they actually were headphones so that they could hear the slow parts? >> Walter Lasecki: Right. In this case we weren't using the time warp in session participants, but you can imagine doing exactly that and using headphones. It was a little bit more challenging of a task and we didn't get one. Mostly we did that because there is still in this set up server latency, basically. We have to stream the audio to our server that works it and send it back and so we didn't want to add that like one second of latency. >>: Presumably because I'm attending the talk it's something that I'm interested in, but did you get a sense of people that were doing it how much it detracted from their experience? >> Walter Lasecki: Yeah, that's a very good question. It is often difficult to follow content. We were only using five people here, so that's still a reasonably intensive task. When way that we've looked at this is here you see a little bit of loss. If we were using for people you would feel it even more. What we found is that basically if we want to use volunteers in a classroom where we actually have students and it turns out that a vast majority of students would be willing to help a peer as long as it didn't hurt their understanding of the content. So what we can usually do is find 30 students in a classroom and make it so that you only have to type a few seconds every couple of minutes, which then makes it very easy to follow along and it doesn't detract at all. The general approaches that we use here can actually generalize things like coding behavioral video in a much shorter amount of time then it would usually require if you were using an undergrad, so a couple of minutes instead of a couple of weeks in this case. And even to activity recognition settings where we actually need high-speed action labels to ensure we understand what's going on in an environment in real time. The high-level take away is that instead of selecting a single person or a single answer to include in our final output, we want to synthesize our answer from a set of different people and this allows us to go beyond what one person is able to do and actually stitch back together something that's higher-quality. Now I want to talk a little bit about where I want to go with this work in the future. I really started this talk by describing how micro-tasks has allowed us to use human computation settings that computers can't operate alone. We can get labels to images. We can answer a lot of the simple tasks. It works great for batch process tasks or anything where we only need a single response. In this talk I've shown how we can actually greatly expand the space of problems we can use human computation in by looking at continuous tasks and how we support interaction over multiple turns of interaction with an end-user. But we are still looking at the systems that we create, these intelligent systems that we create as tools. I really want to look at in the future how we can use mixed initiative principles to go beyond this idea of asking a question and getting a single answer and actually work more deeply with the crowd and actually get the crowds insight into problems as we work even when we don't know that we had a question that needed answering. One space that I'm going to explore this in is let me give you an example of smart prototyping, so basically taking an initial napkin sketch and seeing how fast we can turn it into a functional prototype. In work that's appearing at Chi this year I built a system Apparition. You see a little video here. And it's basically a platform for exploring intelligent prototyping tools. The user can sketch a rough version of what they want while describing it out loud and the system will automatically convert their sketches to real elements as you go. This is basically like describing it on my whiteboard and ending up with a more realistic interface. Here we are prototyping a simple platformer game. You can think of Mario and the user is able to describe something, sketch something and the crowd is working behind the scenes to actually make this into real content or update the grass that is green in this case. After about a minute we have a sketched prototype, but we want to add functionality. So the user creates a character, describes basic behavior, so it should follow where they click and within 3 seconds the character actually has that behavior. You see they click in the character follows where they said to go. They can keep using this behavior and they don't have to keep re-specifying it every time they interact with the system. The system remembers how it's supposed to work. You can see here they play a simple game to get across to the other side. >>: Is that because the crowd workers are have programmed the behavior on the screens or is it because crowd workers are doing like Wizard of Oz in real-time? >> Walter Lasecki: Here we're looking at basically a collective Wizard of Oz process where people are coordinating to control different pieces of the interface. In fact, we have a lot of new interaction techniques that make this possible where prior synchronous drawing systems actually didn't support truly synchronous editing, things I Google Draw, for example, don't really work in this setting. There are a lot of different things to coordinate workers, but it is more of a free-form task than typical. And were able to actually get some more flexible systems from it as well, so not just workers coordinating, but actually the system tries to do some gesture recognition for the drawn element and actually will post a to do item here on the side when it doesn't know so that workers can take that task and help complete it. This is really cool. This is still using the idea of using the system as a tool. I want to go beyond this and look at how we can take some of these prototypes we've created and let the crowd help us create improved interfaces. Maybe in this case that's importing more thematically appropriate content, so we don't actually have to go back but the crowd can help us figure out what makes the most sense in the setting we described. And this can actually be very useful for blind developers who might actually need otherwise help from sighted peers to them up with their finalized version of their interface even though they have an intuition of what they want. We can always use the fact that the crowd themselves understand why certain things happen and start to capture that idea more formally, so maybe things like collidable surfaces, so you have the box in the ground or something that the character can't pass through. But actually the sun is in the background and people understand that intuitively, but the system just sees polygons. We can also capture the idea of playable characters and maybe some of the relationships and actions that they can carry out and even what happens when those actions are taken and they interact with one another. And we can use this idea of using the crowd as a formalization layer to help the system better understand what's happening in the world and to actually start to predict what might happen in the world. And if we keep a small frontier we can start to borrow a page from predictive parallel computation and try to guess what will happen, precompute this using people in the crowd, so go ask what might happen if I reach this state. And when the user actually gets to that state we can actually have a zero latency response. The computer already knows. It just has to apply what is about to happen. If we can start looking at zero latency crowds in the same way that we are going from hours to minutes and then minutes to seconds and open up whole new spaces of applications, I think actually zero latency or a few millisecond responses will do exactly the same. And then kind of behind the scenes, how can we use this understanding of both the system and the crowd has about the application domain to actually template out code or generate code either from scratch or from demonstrations or even help workers and users import generalize examples from off-line crowd sourcing platforms and community exchanges into specific examples that we want to work automatically. I think these mixed initiative systems will help us go beyond just a naïve mixing of these three contributors and actually look at how we can best take advantage of the strength of the crowd, the creativity of the crowd, the creativity of the end user and some of what AI can do to understand and reason about the domain. I talked about how we can support ongoing interactions with the crowd in this talk. And I really focused on the fact that consistency is the key to success. We need to be able to do that and in all of the systems that I showed we need to have consistent output that doesn't conflict with itself but respects kind of the progression to get to that point in the interaction. And to design systems that do this I've introduced the idea of real-time crowds actually as a crowd agent, an interest in architecture for how we can create these systems. And I'm really excited about how we can support richer and richer interactions with the crowd via things like mixed initiative principles. Thanks. [applause]. >> Kori Inkpen Quinn: Any questions? >>: I have kind of a question and kind of a comment. I saw your crowd parts when you were driving your robot around and in many ways it felt like [indiscernible] Pokémon. Did you find it there was a way to kind of exclude trolls out of your system automatically? >> Walter Lasecki: Yeah. One thing that drove me a little nuts about the [indiscernible] Pokémon, while it's an awesome example, basically this is a game that was set up I guess last year that allowed a lot of people to play a Game Boy version of Pokémon, but it was 10,000 people all playing at once and just kind of throwing input at the system. I think it had one or two modes. It was kind of mob control and a vote although it wasn't clear exactly how they were voting. I think they were also bidding by time. This is basically two of our baselines. One thing I didn't show is that mob was also something else that we tried that doesn't really work because you again lose consistency over time. What we see is that when we are actually learning the weights and using the crowd's collective voice to actually that workers, we are able to very easily identify people who are trying to differ from the rest of the set. We can't guarantee that those people aren't malicious, but since things like Pokémon or this navigation things are relatively straightforward in terms of making progress, people who are not contributing to that direction are pretty easy to detect. So yeah we can do that. Yeah? >>: Can you define again what it was you meant by [indiscernible]? I wasn't totally clear on why I saw the earlier systems you talked about didn't, as mixed addition. It seemed like the one where the flying people all took video while the chorus of workers were dialing in seemed similar to the [indiscernible] example. >> Walter Lasecki: Yeah. That is actually kind of a nice example of a mixed initiative interaction, but it's not something we necessarily designed for. It was kind of a byproduct of the way we designed the incentive mechanism and because we had this rolling process where we didn't just restrict people to kind of a single contribution, we were able to get this insight. It was the same thing with the Chorus example where we come back and actually suggest the city pass even when the user didn't ask for it, so I really think it's about focusing on trying to design the systems rather than taking advantage of kind of what naturally occurs. In those systems it wasn't the initial intention. >>: And do you think in systems like that is it important that the end user believes that they are actually interacting with a single actor or with an intelligent computer system versus that they know that they are interacting with a crowd of workers? Is it important to hide that fact from them or is it not? >> Walter Lasecki: We were mostly focusing on hiding this fact in the context of trying to make sure that we didn't have multiple threads of interaction. It's confusing to actually hold multiple conversations at the same time, so from that standpoint we don't want to feel like we're talking to a set of people. But at the same time once we have some of this filtering, it turned out that it was actually useful to end-users to include some information about how confident system is. So we can tell people yeah, there are multiple people behind the scenes and this is like the level of agreement on a given answer. And people find that informative for selecting, so it's a combination. Yet? >>: So crowd sourcing research has exploded the past several years. What can you see going forward? Do you see the types of fads and activities that people are applying to crowd sourcing staying somewhat consistent or do you anticipate that worker [indiscernible] this or deviate often in a different direction? >> Walter Lasecki: I think we've only seen a fraction of the actual application that we could create. Even this idea of moving towards more mixed initiative systems and collaborating with our intelligent systems is something that really hasn't been done yet even though we keep claiming that we have an intelligent system. It is very much designed from the tool perspective, so I think there will be a lot of brand new applications, not just kind of solidification of prior work. I also think that hopefully the platforms will become better and make it a little bit easier to work in this space. >>: What do the platforms need to become better? >> Walter Lasecki: Right now, for example, mechanical Turk really doesn't have real-time support. We're kind of hacking the system a little bit to pre-recruit our own crowd that has these properties which is very hard to scale and very hard to -- it's interesting that it's hard to scale and it's hard to one off. There's something like middle level where we need some sort of workers but we don't need too many. I think if the platform actually natively supported this kind of task routing to individuals who are available at a given time or at least allow people to actually be kind of on call we would see more people working in this space actually. Yeah? >>: A lot of the work super intensive sort of embed their crowd and from the user perspective pretty intimate settings like for blind people it might be their bedrooms or in the transcription sense the content might be something that you don't necessarily think you want to spread to the entire crowd. Have you thought about privacy? >> Walter Lasecki: Yeah. We've actually done some of our first work in this space with Jamie and AJ, actually, where we were looking at how might you attack these systems, how might malicious workers in these systems actually be a threat to the end user. And while we're not saying high-level of kind of malicious users of people who would actually do bad things with the information now, certainly that will grow over time. It's also true that a single individual is not necessarily the biggest threat because a lot of the de-noising properties can at least prevent bad answers. So in the blind user case where we actually want to make sure that we don't misinform the end user, that often gets filtered out with kind of the same approaches we are using here. If we want to prevent people from seeing it we are now looking at how we can create intelligent filters that actually use people but never reveal that information, which I can talk a little bit more about off-line if you're curious. >>: You mentioned zero latency crowd. How is that possible? >>: Don't take my question. [laughter]. >> Walter Lasecki: The idea here would be that as we start to formalize more and more of the information about the domain that in this case we just created, the system can start to run a planning process of what might happen. Where could we go from this point that this has reached? Where do we expect go based on prior interactions? And once we can start to narrow down the size of that guess set we can actually start to precompute, maybe on the 3 second front here, if that's our current latency, what might happen, get that answer and then have it ready when we reach that state to immediately fire. So we are talking about a few milliseconds, whatever it takes a computer to now respond rather than whatever it would take to be surprised by that new state and then go out and get an answer for that exact moment. >>: There was some other lady who was wondering at some point about where these things are going. One of the things that you talked about a lot [indiscernible] system is latency and trying to get that down towards zero. I guess also I am seeing in a number of different things that you have done is that some still feel that you have this specialized deal with this problem and this time working kind of solution and find the right information. These are the problem is different like do you envision that we will have like sort of more general solutions to these problems for reducing latency or is it going towards just building like a tool set of like 58 different kinds of things? Are there generalizable solutions? Is it kind of a new problem [indiscernible] exploration phase? What's your sense? >> Walter Lasecki: Yeah. What I focused on here is in the broadest sense focusing on the type of input and output and going back to this crowd agent and architecture and what types of input we have to deal with, a stream that's faster than individual people can contribute to, that kind of thing. So I think that there are those broad classes and if we see the same problem or the same kind of properties of the problem we can use the approach. There are also going to be these more fine grained optimizations like time warp where we look at how people complete this task for some of the specific human factors for doing that task that we can then augment this with. Scribe is a good example in that the reasonably general approach is actually something we've done very different things like activity recognition using a very similar process, but there we didn't use things like time warp. >>: We are going to take it in a different direction. Mechanical Turk is for the latest crowd sourcing crowd workers, so it's about getting a crowd to work and help me complete a task. The robot driving is a little different, the same as the Pokémon game. That's more crowd participation, so it's not like I'm doing a task for you. It's like I want to play this game with a crowd. Has there been much of those types of those activities, where it's not that I'm just working for somebody, that I'm doing this for my own enjoyment with other people? >> Walter Lasecki: Not so much within the computation space. You do see other things in crowd sourcing that maybe are. You can even think of things like other sourcing for community events as a type of this where we are all working towards the common goal, but there's not necessarily a computational process involved in that. I think that it is interesting that when we jointly play a game set up in Legion is self-directed. It's not necessarily true that the robot driving was. So there we actually were specifying the goal, so we are looking for a process. But the gaming is a very good example using the same system. There we were actually not hiring people but letting people as you kind of saw in the picture sit on a couch and collaboratively play this game. I think that we can transplant some of those ideas from the computational space to kind of the more general self-directed crowd sourcing. I also think that this isn't like Apparition or trying to use this one level abstracted computational process where we know people are completing something. Computing. Using [indiscernible] but just with their mental process to get to the response that we need and that's integrated in a certain way but there it's much more undirected and I think what we can learn from kind of teamwork settings informs how we do the coordination in things like Apparition where we don't have I know I need to do x. It's kind of up to the workers to figure out what they need to do. >> Kori Inkpen Quinn: Any last comments? >>: That just made me sort of think about do you have thoughts from the workers perspective as to what it's like to do these small tasks or participate in these things? Because when you think about playing the game that sort of us the need it rather than it being an external worker. >> Walter Lasecki: There is a lot by type of task. So what we see is a lot of the assistive technology work where people are actually even more engaged than by micro-tasks and they feel like they have contributed a bigger piece when we use continuous tasks. But in general knowing that they are doing good is something that helps. I think that in general from the worker side, kind of this line of work we are looking at, we're continuous with our interaction with the system and that also avoids a lot of the interruption problems that we see in traditional micro-task crowd sourcing. There is actually a paper that looks at some of the detrimental effects on workers to having their workflows constantly interrupted in different ways that are pretty common currently when we use micro-task and the punch line is that it can take people up to twice as long, which of course they are not paid differently for, so was just a little bit of wrong routing it can essentially cut down their effective pay rate in half. So this is trying to get away from some of that, some of those problems. >> Kori Inkpen Quinn: Okay. Thank you. [applause].

>>: Each year Microsoft Research host hundreds of influential... including leading scientists, renowned experts in technology, book authors and...

Related documents

Products

Support

&gt;&gt;: Each year Microsoft Research host hundreds of influential... including leading scientists, renowned experts in technology, book authors and...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>>: Each year Microsoft Research host hundreds of influential... including leading scientists, renowned experts in technology, book authors and...