>>: Each year Microsoft Research host hundreds of influential... including leading scientists, renowned experts in technology, book authors and...

advertisement
>>: Each year Microsoft Research host hundreds of influential speakers from around the world
including leading scientists, renowned experts in technology, book authors and leading
academics and makes videos of these lectures freely available. [music]
>> Kori Inkpen Quinn: Okay. I think we can get started. We'll very excited to have Walter
Lasecki interviewing with us today from the University of Rochester. Walter is a student of
Jeffrey Bigham and James Allen. He's no stranger to many of us here. He has done two
internships and was an MSR Fellowship winner. Walter also one best paper at WISS 2014 and
his work which you will hear more about in a minute is related to continuous real-time crowd
sourcing.
>> Walter Lasecki: Thanks for the introduction. Thanks for having me out. Today I'm going to
be talking about how we can use human intelligence to augment automated systems, that even
go beyond the boundaries of AI and even solve problems that neither people nor machines can
alone. The resulting intelligent systems can be used to solve important problems today, such as
providing better access to technologies to people with disabilities. But it can also scaffold
future intelligent systems when we are thinking about how we build and train them. Why do
we want intelligent systems in the first place? We can do things like have more contextualized
and natural interactions with computers. We can offload some of the low level tedious tasks
that we have to do and focus more on our high-level goals. We can even convert one form of
media to another such as converting speech to text in real-time with closed captioning. While
these are really interesting and they are very generalizable impacts of the systems can have, I'm
particularly interested in how we can use these technologies to actually help improve
accessibility, whether that's accessibility for low literacy or low technical literacy individuals
who might need a more natural way to interact with information than the typical interfaces
that we use, or even helping people with disabilities, such as people with motor visual
impairments who need to control interface that isn't necessarily as easy for them as it would be
for us, or even providing captions for deaf and hard of hearing individuals. The point of all of
this is that this requires solving AI hard problems. These are problems like understanding
intention in natural language or understanding the content of visual scenes. We're just not
there with automatic systems just yet. We need human level intelligence. But fortunately, in
recent years platforms like Mechanical Turk have come out and allowed us to actually access
human intelligence and insight very, very quickly using an API call. Now you start to think about
algorithms that integrate this human intelligence as part of their function. We call this human
computation. I said you can do this quickly, but originally, actually, it took hours or days to get
responses and only in more recent work by folks at Rochester and MIT has it been shown that
you can actually get this down to actually just a couple of seconds to get people to show up at a
task. In fact, last year I released some of the first publicly available tools that allow anybody to
go in and recruit small groups of subsets of the crowd in advance so we can have people on
demand exactly when we need them. We can get people to a task quickly, but it's very
different to manage people in a computational process if you compare that to hardware in the
more traditional sources of computation. That's for a number of reasons. First, a lot of people
really use these platforms because of the flexibility that they allow in the way that people work.
You can arrive at any time as a worker. You can pick tasks that are really the ones that you're
interested in and nobody's saying you have to stay around. You can leave whenever you want,
which means we may not know if a person is going to complete a task or come back for the
next task that we would think of as being a part of the series. Even once we get a response, it's
not necessarily true that we know what the skills of that individual are, how much effort they
are putting in, maybe they are even trying to game the system. And sometimes people just
make mistakes. This is a configuration that actually leads to them seeing the problem
differently than we expected. This means that we don't know if we're going to get our task
completed. We don't know what order they are going to be completed in and we really don't
know what the quality of the answer is even if we get it. So the field is really focused on how
we can solve these problems by taking a task, breaking it down into small context free units of
work that only take a few seconds or a few minutes to actually complete, called micro-tasks.
And then using that divisibility and that context free nature to actually let workers on these
platforms take the task in whatever order they feel is best. This works pretty well. But then we
realize once we get these answers back we don't have a good way to confirm that they are
actually correct. In most cases it's actually just as hard to confirm an answer is correct as it is to
solve a problem in the first place. So instead we are going to use the fact that these are small
units of work that are actually designed easily comparable pieces. What that means is that we
can actually collect a set of responses to each of these questions and use a simple aggregation
scheme such as voting to get a more reliable answer that we can really have faith in. This basic
idea has been used in a wide range of systems that we have seen so far. There are things like
coming up with arbitrary image labels for images on the web, answers to visual questions for
blind users, even finding just the right moment to take a picture with a smart camera. It's also
been used for things that are nonvisual, as an example, NLP and linguistic tasks, finding
linguistic annotations for text data sets, editing documents or even translating the documents
to another language. This is really just a small sampling of the space here. There is a lot of
work in the HCI and AI communities on these types of systems and really we have seen in the
past couple of years an entire conference sponsored by AAAI, HCOMP and even an HCOMP
Journal arise to support this community. But while micro-tasks are extremely powerful, they
are fundamentally limiting the types of tasks that we can think about for crowd sourcing and
using human computation as part of the process, because not everything can be broken down
into these little pieces of work. I'm going to make the claim that we can actually create realtime systems that support multi-turn interactions. Even though we have a very dynamic crowd,
we can maintain context and maintain interaction with these systems. I'm a systems builder, so
I have built a number of large web-based systems for coordinating workers in real time around
a single task over the last four years. And for each of the systems, we kind of look at a way that
we can solve a very generalizable problem and a whole space of problems while also solving a
specific instance of a challenge that we had not known that we could use crowd sourcing to
solve before. And then for each one of the systems I built I've looked at what are some of the
other problems in that space, the actual application space, what are some generalizable
properties about the crowd we can use to inform future systems, and even, how do we get
around some of the hurdles that we see in deploying these systems in real domains? Since I
can't talk about all of this in one talk, I'm really going to focus on explaining how maintaining
context is critical to a wide range of tasks. Then I'm going to introduce a new type of task,
actually not micro-task, but a continuous task that keeps people engaged for longer periods of
time and goes beyond a lot of the assumptions that we made for micro-tasks. And then I'm
going to talk about how this allows us to start thinking about going beyond the traditional
impetus for crowd sourcing, which is we have a task that people are good at but computers
really aren't, and even looking at tasks that individual people are not necessarily good at. I'll
talk about some of where I want to go with this work in the future. Let's start with maintaining
context, and I'm going to use the example of VisWis which is this system that was originally
introduced in 2010 that allows people, blind users to take a picture, speak a question and get
an answer within about a minute, which at the time was state-of-the-art. I helped deploy this
system in 2011 and the resulting data set of almost 20,000 images asked by thousands of blind
users is really fascinating for a number of reasons, but in this data I found a set of cases where
it really shows that micro-tasks are not the end-all be-all of human computation. And I'm just
going to give this one example here where a user asks, what are the cooking instructions on this
TV dinner? It's actually a pretty popular class of questions, and understandably, the
information needed to answer this question is very hard to frame for a blind user. There's no
tactile feedback on the actual package of which they are taking a picture, so they don't know
that they didn't capture the cooking instructions. When the system goes out and recruit say
worker or two to answer this question, they say I can't see the instructions. But then they
usually will try to give a little bit of feedback on how this might be corrected. They will say why
don't you flip the box over? So the blind user flips the box over, again take a picture and ask
how about now? But the system doesn't necessarily recruit the same workers and it's not sure
that that person is always going to be available. Instead, it goes back to the crowd, it posts
another micro-task, gets a different worker who then doesn't know the original question. So
they say something very general like it's a box of food. So the user goes in, fixes this, realizes
what the problem was. They re-ask the question, take a picture again. The system goes gets
another worker and they say well, I can't see the answer so why don't you flip the box over?
These kinds of classic interactions can actually take a very, very long time. In fact, even though
we get answers to every question back in about a minute, it turns out that this typically takes
about 10 to 15 minutes before a user can actually get an answer to their question or they just
abandoned the task altogether in this case. The core problem here is that the end-user is trying
to have a conversation with the system that maintains context when the system doesn't
actually support that type of interaction. I'm going to focus in on this problem and look at how
we can solve the challenge of actually holding a conversation that goes and takes multiple turns
and maybe even spends multiple sessions. We go away and we come back tomorrow and want
to continue that conversation from where we were. To explore this I created a system called
Chorus which actually is a virtual and personal assistant powered by many crowd workers
behind the scenes. To the end-user this actually just appears as a chat conversation with a
single individual. They have a pretty standard instant messenger window and they are able to
interact with the system. Behind-the-scenes workers also participate in this process of curating
a sort of working memory. These are in important facts. It's kind of a predictive task where
workers try to speculate on what would be important to future workers who are completing
this task. That might include capturing things like allergy information or location information
about things that they shared in the past. To actually hold a conversation it's a little bit more
complicated than just posting a chat message. Instead, we actually have a collective process
where workers are all proposing and voting on one another's answers. We eventually have a
rolling voting scheme. To incentivize people to give us the right answers, the best answers that
they can, we have a mechanism that asks people to propose answers and if they actually
propose an answer that's correct we'll gives him 3000 points. They don't get any points if other
people don't agree that they have given a reasonable answer to this task, or given the current
task. Since it's a little bit easier to just vote for one of these answers and it is to just go out and
find information and bring it back, if a worker votes on an answer then they will get 1000
points, again if that is marked as correct. To make sure that people don't just randomly guess
at what the answer is if they don't know, we also add in this kind of no-op bonus. It gives
workers a little bit of a reward for not doing anything because it's pretty easy to figure out
when people are abusing it. One of the nice things here about this incentive mechanism is that
behind the scenes we are actually able to tune a single parameter to figure out the spacing
between these rewards. And what that will effectively let us do is dial in the verbosity of the
crowd, so how many different responses do we want? Do we want a large set of maybe
creative responses, or do we want a pretty narrow set of really reliable responses? And while
there are some issues if you are looking at the actual game theoretic, properties of this
mechanism you will find there are multiple equilibria, but empirically this actually does work to
encourage people to propose more or less answers. What does this all mean in the end? It
basically means that we can filter down the set of responses to just the ones highlighted in blue
here. We really remove things like this line where somebody comes in right in the middle and
says well how is everybody doing? That clearly is not what we wanted right in the middle of the
conversation, so other workers catch on to this and they filter out that response. To the enduser we only see this consistent conversation that is actually accurate and is sensitive to the
correct points in the conversation that we are talking about and actually what we said in the
past and how that fits together. We brought a dozen users in to our lab to actually study how
the system works. We give them a high level objective of what they want to accomplish. We
didn't give them a specific script as that would kind of defeat the point. We ran this within
subject design study. The results were really interesting and I am just going to give one
example here to give a flavor of how some of this turned out. The user comes in in this case
and asks about some activities that they want to do in Houston. They basically ask what the
prices of doing this set of activities and after a little bit of clarification the crowd is able to come
back and tell them that will cost $150. Interestingly, while that's the answer to the question
the user actually ask, a moment later the crowd actually comes back and says actually there is a
better answer. There is a city pass available from the local tourism office and that allows you to
do everything and actually a little bit more than what you are expecting and it's cheaper. So
this is an answer that the request from the end-user did not actually expect to need, but it was
kind of serendipitously found by the system by this collective search process behind the scenes
that can discover information even if the end-user didn't ask about it explicitly. And because of
this serendipitous information finding, we actually find a majority of our participants preferred
Chorus to a keyword search like Bing. More quantitatively, we can also show that the system is
able to stay on topic, is able to give reliable answers that are very accurate and answer what
people are actually asking about. And interestingly, we are also able to recall 80 percent of the
facts that were actually from a prior session and no workers were present when the user
originally sent a piece of information, but they were able to recall it using this working memory
window 80 percent of the time, so we didn't have to re-ask. We didn't have to rehash the same
ground. It is possible to support this type of multi-term interaction even though we have a very
dynamic crowd behind the scenes that is constantly changing. And if we look at the roles that
we actually have people completing in the background, given a question, some people are
going to propose a new answer and others who are going to help filter that answer down so
that we only forward the correct answers to the end-user. It's easy to see also how we can use
this same structure to include things like dialogue systems. It doesn't matter that this isn't a
person now. If it produces the right answer, then humans will go in and filter this and make
sure that we are only getting good answers to the end-user still. But if it produces an incorrect
answer, in this case this is the wrong city, although it sounds like the right city. It's the same
name but it's a city instead of a region, then workers can check on this and find out it's not the
right information and propose a new answer and then filter that so that we know the human
answer is correct. Not only do we prevent the system from every giving the end-user an
incorrect response and we prevent the system from ever corrupting the end-user's experience,
we also get the advantage of actually having this as training data. So where we would have
normally derailed conversation and gone down a conversational path that doesn't really fit
what the end-user wanted to accomplish, we can now stay on track and see what would have
happened even after the system makes a mistake. And yet we still know we made a mistake
because the system's answer was not voted on and forward. This gives us kind of a new way to
think about how we deploy and then train AI systems in the wild. Now I want to talk about a
new kind of task, a continuous task that goes beyond a lot of the assumptions of micro-tasks
and lets us solve a whole new space of problems. This is an example of something that I think
we can't solve using micro-tasks alone. I'm going to use the example of driving a robot. This is
very important for home assisted robotics where you might have somebody with a mobility
impairment who needs a little bit of help actually accessing things in their home. But how
would you control this with micro-tasks? It's a naturally continuous space. We have to be able
to sense the environment. We have to be able to listen to the user and decide what we actually
want to do about that. We want to provide continuous control to the robot which does not
expect discrete input and we have to actually be able to respond to events in the environment
as we encounter them. As a proxy for this home assist robotics task I'm actually going to use
this little off-the-shelf Rovio robot where we have a little bit of an obstacle in the way and we
have to drive around the obstacle to get to a target. Instead of breaking this down into little
pieces, we're actually going to give workers control of an actual streaming video interface that
is from the front mounted camera on the robot and we're going to let them use the default
controls that the system comes with which are basically just arrow keys and a few other simple
controls. If we do this and we give control to a single person, what we find is something like
this where the majority of the time we actually failed to complete the task because people will
join, give it a good first shot to solve the problem and then maybe in this case they failed to
complete. They crashed into a wall and then they abandoned the task. This is a top-down view
of our set up in our lab with a trace of where this particular instance of the navigation task
went. This is because there are going to be some noise in the system. Not everybody is going
to want to stay around and complete this entire task and if it's not easy to do they might just
have a higher rate of abandonment. If we use the fact that we have lots of crowd workers
available, we should just pick a new one from the crowd and let somebody else control
whenever they abandoned like this, we get something like this. We find out there are trolls on
the internet, basically. Somebody here navigates the robot almost of the target, but it's back
and forth and back and forth and back and forth and then abandons the task somewhere there
a few minutes later and finally a different worker will connect and finish a task. This of course
was something that raised our completion rate in this case because eventually somebody will
finish most of the time, but if we could drive this robot off a cliff it would be pretty problematic,
right? It works here because you can't destroy the task. What we really want to do to make
sure that we don't have this kind of malicious input or otherwise noisy input is usually
aggregate people's responses. If we use simple voting in the case where we break something
down into micro-tasks, we want to use something similar here, but a lot of the standard
approaches for dealing with continuous input don't necessarily work here. I use the example
that if the robot drove up to a tree it can go left. It can go right. There is a perfectly good set of
options, but we don't want an average. Instead, we're going to take a vote, but that requires
actually discretizing time, so this is going to be a much slower approach where we need to wait
for one second of input, look at the most popular answer and then use that to control the
robot. The problem is still that we are not taking a consistent stream of options that anyone
worker would have selected, so we have a lot of conflict early on here, where when we were
trying to decide how to get around this barrier, do we want to go forward first? Do we want to
go to the left first? People ended up crashing a lot in this case. That doesn't quite work. To
solve this problem, we're actually going to borrow a page from representative democracy and
we're going to pick one leader who is in control at any one time but then not make that
necessarily the same person over multiple periods in time and I'll show you how this works
briefly. If we wanted to really solve this problem, we would have to solve this POMDP where
we really want to get agreement between different worker's policies, but of course, we can't
see what people would do that every point in time. And the fact that there are different
representations in people's heads and so we actually don't even know exactly how the states
line up with the real world, so this isn't very tractable. Instead we're going to try to
approximate this by looking at people's input over time, comparing it to the rest of the group
and then picking which are the most representative of the group to be in control. Given that
we have some input and we have some funny properties of this input, such as we can't really
assume that these two up commands just because they look the same are not necessarily the
same. We can't assume they are the same because going forward a moment later might be a
very different action than going up at the first moment. So instead we're just going to bend this
over a 1 second time period, look at each worker's inputs for that time period and use a
similarity function here, cosine similarity to actually compare them to the rest of the crowd, so
we get some score as to how similar they are to the rest of the individuals who are contributing.
Then we'll use that score to actually update a weight and we do this actually for every time
point and we do this for each worker in the crowd. This means that at any given time point we
can just pick the highest weight worker and give them direct control as the kind of
representative leader for that period and then as we keep updating these weights, we'll get
essentially a new person in control at any given 1 second span, but we won't be interleaving
people's input on a finer grained scale. The system I created to do this was called Legion, and
Legion was able to not just control a robot, but actually an arbitrary user interface by letting me
select a portion of your desktop, give a natural language command for what you want
completed and then it would actually go about the task for completing that with the crowd.
We have used this to control anything from office software to assisted keyboards for people
with motor impairments to even letting multiple people play a single player videogame
collectively. The idea of actually being able to keep people involved in a task for a longer period
of time, also points to this idea of a new way to maintain context. If we had one person here
for the entire task, then they could remember what was happening, the same thing for the
conversation. To set up a little bit of a test for that I actually used a videogame set up. We
created a custom map here and basically the idea was we had people using Legion to control a
videogame character. They were walking in this cyclic map and it took about 30 or 40 seconds
to get from one decision point to the next and the decision they made was essentially to push
the white button where the black button depending on some prior instance that they saw. We
actually ran this for a full hour and it turns out we can complete this task even though people
were not present for more than a couple of minutes. At the time we could complete this task
consistently and reliably the entire time. Here you see an input where, a graph of people's
input. You can see workers on the y-axis there and each red marking is basically a point at
which the worker contributed some input to the system. And we see that maybe worker stays
for a few minutes and then they leave and somebody else joins for a few minutes and then they
leave. But the way we were able to actually control this for a longer period of time is that few
look it when people actually join a task, versus when they start to contribute, we see that there
is basically this synchronization period where they are watching what other people are doing to
learn how to complete the task and then they use that to continue on in the same way. This is
kind of reminiscent of organizational memory from the behavior literature where this is
essentially the process that organizations and societies used to pass traditions down and other
types of behaviors. This really gives another interesting way to capture this idea of context. I
talked about so far two very different ways to accomplish this goal of maintaining context over
time. I want to zoom out for a little bit and talk about some of the more general framework
that this fits into. Specifically, if we wanted to create these systems, what is the architecture?
What does it look like? We are always going to need a way to divide these tasks either into
small pieces or into actual roles that people synchronously coordinate and complete. We're
also going to need a way to collect input from workers, an interface that lets us get input from
contributors and contributors could be human workers or they could be automated systems
that know how to do all or a piece of this task. In either case we will need reward schemes to
make sure that systems are incentivized correctly or workers are paid fairly for their efforts in
the system. Then we're going to need to think about the communication channels between
people and how we actually think about maintaining context without allowing collusion,
without allowing people to really coordinate to the detriment of the incentive mechanism that
we are using. And then we are always going to need to actually aggregate these responses.
Whether we're thinking about controlling an interface or about holding a conversation, we
don't want parallel threads in existence. We really want a single control stream. We really
want a single conversation. This idea of having a single input and a single output is the same
idea as having a collective that act as a single individual, which we'll call a crowd agent. This
idea kind of harks back to the agent architectures that we see in the AI literature in prior work.
This gives us a way to really think about how to design these systems in the future and how we
can design for interactivity specifically. Then I want to go back to this idea that I started in the
beginning about providing visual persistence. To blind users and this problem that we saw
where people were thrashing over kind of corrections and follow-up questions, and think about
how we can solve this using crowd agent framework. To do this I created a system called view
that actually engages the blind user with a crowd of workers who actually hold an ongoing
conversation you see on the left here. About now not just a single image but actually a
streaming video, so they can now give multiple corrections. They can stay around to see what
the result of their feedback to the user was and it makes it a lot faster to hone in on exactly
where the correct information is. So if we're looking at tasks like finding nutritional
information, ingredients, allergy information, things like that, we can actually reduce the time it
takes on average from 10 to 15 minutes with a prior single image VisWis system down to about
1 or 2 minutes, so you've got an order of magnitude speedup your because of this interaction.
Everything I've talked about so far and really all they prior work in the field has really focused
on how do we complete tasks that people are good at? But I want to go beyond that and
actually look at cases where individuals might not necessarily be good at a given task that we
want to complete. But computers aren't up to that task either, so we need to kind of start
using the collective ability to go beyond what we can do individually. As an example of this I'll
use the problem of real-time captioning. This was a very important accommodation for deaf
and hard of hearing users, but it's really very difficult when you need a low latency, maybe a
few seconds a word and that requires typing hundreds of words a minute because people speak
very, very fast and we need input to be accurate. So generally if we take a person it's probable
that they are not going to be able to do this test very well. I don't think any of us in this room
could actually keep up trying to type this talk, for instance. I'm specifically going to focus on a
classroom scenario where we have a single speaker and at least one person in the audience
who needs this accommodation. Currently, automatic speech recognition, while there has
been some really amazing advances recently, is just not up to the task. Because of the variety
of different speakers, the different speaker conditions, for instance if the speaker has a cold, for
example, the different lexicons that we might be using and vocabulary that we might actually
have any given lecture and the different acoustic properties of the environment that we are not
sure of in advance, really makes this a very challenging problem. Automatic speaker
recognition does not reach the legal accommodation under the American with Disabilities Act
for being a viable service here. Instead, the current state-of-the-art is actually using people. It's
computer-assisted people. Professional CART captionists are individuals who have trained for
years to be able to type hundreds of words per minute within a few seconds of each word. But
because of this training and because they're so rare, it means that they are very expensive. It
costs a couple of hundred dollars an hour to hire these people and they are not easy to
schedule, because again, if you have to find somebody with this rare skill it takes 24 to 48 hrs.
notice and that's kind of the standard accommodation latency that you see like in a university
setting, for example. This not only means that it's difficult to afford these accommodations for
people who actually have to provide them, disability service offices at universities, but it also
means that we can't always access them, not just because we have this lead-time, but because
of the cost itself. A lot of times you'll see that universities don't actually accommodate
students that they want to work in teams after class. The in class is supported, but it's not a
legal requirement to support after class and this is not something that is typically done or
feasible given the budgets of these offices. What we like, instead, is for students to have the
ability to walk into a class, walk into a group setting and pull out their phone and get captions
immediately. To do this I created a system called Scribe, which actually streams audio from a
user's mobile device to a server that then breaks this task up over multiple non-expert workers
and then coordinates all of their input and aggregated back together to provide a single caption
stream within a matter of a couple of seconds. This means that we can make these captions
available at any time. In fact, because we're using non-experts, they are highly available and
they are also cheaper. We can pay students 10 to $15 an hour. This is not something that they
have spent years training for. It's just anybody who can here and type. To get a group of
people to do this is still costs about a quarter of the price of a professional. We can now use
students or other volunteers who might actually have subject matter expertise. So if we are
captioning computer science talk or a mechanical engineering talk, we can find somebody that
actually knows something about that domain, rather than has a specialty job just in typing. But
how do we actually coordinate people to do this task? The simplest high-level version of this
would be to do a round robin. So we start with one person, give them a few seconds of audio
that we want them to type and then say don't worry about typing while somebody else is taken
care of the task and we'll hand this to the second person and say okay now you have a few
seconds to type and so on and so forth and eventually, the first person will catch back up so we
can integrate them back in and have them type a few more seconds of what they hear. Of
course this type of coordination requires a lot of interface accommodations for figuring out
how to coordinate people and actually prompting people to input at the right time. So I want
to show you a little bit of how Scribe does that in this video. Specifically, I want to focus on the
fact that typing will lock-in, so as people type you will see it just kind of go gray. They can't go
back and edit because this takes far too long and increases the latency by too much. It also
provides a lot of cues to type and some feedback in terms of bonus points and rewards to
encourage people when they are doing a good job and this can map to money in the case that
we're actually using Mechanical Turk workers or other paid crowds to contribute to this task.
You'll also notice that the volume will increase and decrease to help more with the saliency and
this idea of queuing people to type at a certain time. And that's not working. This is supposed
to have audio. I'm not sure why it doesn't.
>>: The television as he knows one way generally entertainment oriented and paid for through
advertising although increasingly by subscriptions now with cable television. And finally, the
current big thing, the internet which came to us starting in 1969. The internet itself has been…
>> Walter Lasecki: That was a little bit less content than I expected, but you get the idea.
We're prompting people. We have a bit of a salient cues in the volume as well as some direct
visual cues. As you see, people still make mistakes. People still make mistakes because this is
still a challenging task even for those few seconds. If somebody speaks to quickly we actually
go beyond someone's working memory. To make this easier I looked at what people who do
off-line captioning actually would do, and it turns out that they slow down the audio. Off-line
captioning is the same task as real-time captioning, but without the time constraint. You give
me the captions tomorrow and that's perfectly fine. While slowing down the audio works in
that case, it's not necessarily true that it can work in real time. We would obviously fall behind.
So we use the fact that we have multiple people, not just a single captionist to actually slow
down one person is hearing to say half speed just for the section that they actually are typing,
basically. So if they are supposed to be typing a certain segment, we play it at half speed. If
they are not, we take advantage of the fact that people can listen a lot faster than they can
type. And we actually speeded up to one half times, while other people are responsible for the
content. This means that actually we can play the audio at half speed for everyone while they
are supposed to be typing, so everyone hears it slowed down and you still have that context.
You can't remove the context in between the segments. It turns out people are actually almost
as bad as computers when you do this. By allowing people to hear a more approachable
version of the audio, we can actually increase the recall rate and the precision pretty
significantly on these tasks. What's even more amazing is that we actually see a decrease in the
latency, so we're slowing down the audio and yet we're getting faster responses. And the highlevel reason for that and we went back and did interviews with a lot of the workers we were
engaging for this task and actually looking at some of the data, it turns out that a sickly when
the audio is too fast for somebody to keep up with what they'll do is they'll listen, they'll
memorize everything and then they will start typing after their segment has stopped playing, so
basically in the little context period where somebody else's typing. That means we pay a
shipped penalty where we don't have the person typing at all until after we have stopped the
audio clip. It turns out when we are able to slow this down it's a more tractable task and
people can type each word as it comes in more often, so we get this decrease in latency. I want
to show you quickly what that looks like. It'll be another video. It's basically the same as before
but now with this playback speed adjustment.
>>: After that with the television network which arose during my lifetime and television as you
know is one way and entertainment oriented and paid for through although increasingly by
subscriptions now with cable television. And finally, the current big thing is the internet.
>> Walter Lasecki: The current big thing is the internet. People can adapt very quickly to this,
they can follow along and can actually complete this captioning task easier. But there's still a
pretty challenging task here. People can still fall behind. They might miss a word. They might
not know for certain word and it's pretty hard to catch. But we can't scale the same approach
that we have. We certainly have a lot of people available now because anybody that can here
and type, but we can't scale this division and round-robin approach to use too many people
because if you imagine a 10th of a second to a large set of people, the 10th of the second of
audio is not making it easier to caption this content. Instead, what we're going to do is rushing
it at redundant input. We have a few seconds but now we're going to give a few workers per a
few second segment. And collectively this allows people to very accurately and with a lot of
high recall actually provide these captions. And 95 percent of the words that are said we can
capture with 87 percent precision and we can actually do this in under 3 seconds of worker
time per word latency. This is actually kind of on par precision with a professional and better
recall and latency.
>>: What is the difference in cost?
>> Walter Lasecki: The cost does increase. Of course we have a lot of flexibility in the price we
were paying. We already had a quarter. We can do this with a combination of not just workers
but also ASR which I'll mention a little bit, but it's still much cheaper than providing this using
professionals and it's much more available and higher accuracy because of this expertise, or
subject matter expertise, specifically. But of course the problem with it is that now we have
multiple streams of captions, which is we talked about before does not work for many
applications. And it turns out that if you bring students in to actually read captions, read maybe
10 seconds of captions, it takes them 45 seconds if those are parallel captions. They can
reconstruct what was being said but it's so slow that we would lose our real-time constraint just
on the reading along. So what want to do is actually add this combiner phase where we are
going to merge all of the words that people have said back together and create a caption that is
kind of easy-to-read and a single caption. We are able to do that using multiple sequence
alignment which is this process that was originally used in computational biology to realign
genome sequences. And while this will figure out where we have gaps and where we are
missing words and maybe even allow us to align words that we want to compare and de-noise
in case somebody have like a typo, for example, it unfortunately, all of the existing work on this
has been off-line. So dynamic programming algorithms that allow us to do a late binding
answer but we have to have all the input that we want to align first, which doesn't work for our
real-time case. Instead, I came up with a graph-based approach that actually constructs a graph
as we see words based on which words we see immediately next to each other in different
worker's inputs, weights the edge we can actually go back with a language model and re-weight
the edges so that when we find the highest probability path this results in a pretty reliable
caption. This work pretty well and actually works in much more general constraints. More
recently we have been using this A Star search based approach with a beam heuristic dash that
let's us more accurately control the computational and time resources that we use for a given
alignment and integrate the language model in a more principled way. This gives us our single
caption. And we have actually been able to show that this is very useful. We have used this in
real domains, so we captioned a number of conferences including Assets last year, which you
see here. This is basically our screen on the left and we are captioning for the entire
conference and the speaker's off to the right somewhere there. Interestingly, instead of hiring
mechanical Turk workers and instead of bringing people explicitly in to this session to provide
captions, we were able to go into the audience, ask for volunteers and get five students who
had never used the system before. This was about 5 minutes in advance of the session and
they all sat down and they were able to produce pretty good captions even with only a few
minute's notice. This really points to the idea of maybe democratizing access technology. Right
now anybody who's interested, friends, peers, family members can help with this task.
Whereas before, it really wasn't viable to do so.
>>: I've got a question about how that works. People were in the session and so did they
actually were headphones so that they could hear the slow parts?
>> Walter Lasecki: Right. In this case we weren't using the time warp in session participants,
but you can imagine doing exactly that and using headphones. It was a little bit more
challenging of a task and we didn't get one. Mostly we did that because there is still in this set
up server latency, basically. We have to stream the audio to our server that works it and send it
back and so we didn't want to add that like one second of latency.
>>: Presumably because I'm attending the talk it's something that I'm interested in, but did you
get a sense of people that were doing it how much it detracted from their experience?
>> Walter Lasecki: Yeah, that's a very good question. It is often difficult to follow content. We
were only using five people here, so that's still a reasonably intensive task. When way that
we've looked at this is here you see a little bit of loss. If we were using for people you would
feel it even more. What we found is that basically if we want to use volunteers in a classroom
where we actually have students and it turns out that a vast majority of students would be
willing to help a peer as long as it didn't hurt their understanding of the content. So what we
can usually do is find 30 students in a classroom and make it so that you only have to type a few
seconds every couple of minutes, which then makes it very easy to follow along and it doesn't
detract at all. The general approaches that we use here can actually generalize things like
coding behavioral video in a much shorter amount of time then it would usually require if you
were using an undergrad, so a couple of minutes instead of a couple of weeks in this case. And
even to activity recognition settings where we actually need high-speed action labels to ensure
we understand what's going on in an environment in real time. The high-level take away is that
instead of selecting a single person or a single answer to include in our final output, we want to
synthesize our answer from a set of different people and this allows us to go beyond what one
person is able to do and actually stitch back together something that's higher-quality. Now I
want to talk a little bit about where I want to go with this work in the future. I really started
this talk by describing how micro-tasks has allowed us to use human computation settings that
computers can't operate alone. We can get labels to images. We can answer a lot of the
simple tasks. It works great for batch process tasks or anything where we only need a single
response. In this talk I've shown how we can actually greatly expand the space of problems we
can use human computation in by looking at continuous tasks and how we support interaction
over multiple turns of interaction with an end-user. But we are still looking at the systems that
we create, these intelligent systems that we create as tools. I really want to look at in the
future how we can use mixed initiative principles to go beyond this idea of asking a question
and getting a single answer and actually work more deeply with the crowd and actually get the
crowds insight into problems as we work even when we don't know that we had a question that
needed answering. One space that I'm going to explore this in is let me give you an example of
smart prototyping, so basically taking an initial napkin sketch and seeing how fast we can turn it
into a functional prototype. In work that's appearing at Chi this year I built a system Apparition.
You see a little video here. And it's basically a platform for exploring intelligent prototyping
tools. The user can sketch a rough version of what they want while describing it out loud and
the system will automatically convert their sketches to real elements as you go. This is basically
like describing it on my whiteboard and ending up with a more realistic interface. Here we are
prototyping a simple platformer game. You can think of Mario and the user is able to describe
something, sketch something and the crowd is working behind the scenes to actually make this
into real content or update the grass that is green in this case. After about a minute we have a
sketched prototype, but we want to add functionality. So the user creates a character,
describes basic behavior, so it should follow where they click and within 3 seconds the
character actually has that behavior. You see they click in the character follows where they said
to go. They can keep using this behavior and they don't have to keep re-specifying it every time
they interact with the system. The system remembers how it's supposed to work. You can see
here they play a simple game to get across to the other side.
>>: Is that because the crowd workers are have programmed the behavior on the screens or is
it because crowd workers are doing like Wizard of Oz in real-time?
>> Walter Lasecki: Here we're looking at basically a collective Wizard of Oz process where
people are coordinating to control different pieces of the interface. In fact, we have a lot of
new interaction techniques that make this possible where prior synchronous drawing systems
actually didn't support truly synchronous editing, things I Google Draw, for example, don't
really work in this setting. There are a lot of different things to coordinate workers, but it is
more of a free-form task than typical. And were able to actually get some more flexible
systems from it as well, so not just workers coordinating, but actually the system tries to do
some gesture recognition for the drawn element and actually will post a to do item here on the
side when it doesn't know so that workers can take that task and help complete it. This is really
cool. This is still using the idea of using the system as a tool. I want to go beyond this and look
at how we can take some of these prototypes we've created and let the crowd help us create
improved interfaces. Maybe in this case that's importing more thematically appropriate
content, so we don't actually have to go back but the crowd can help us figure out what makes
the most sense in the setting we described. And this can actually be very useful for blind
developers who might actually need otherwise help from sighted peers to them up with their
finalized version of their interface even though they have an intuition of what they want. We
can always use the fact that the crowd themselves understand why certain things happen and
start to capture that idea more formally, so maybe things like collidable surfaces, so you have
the box in the ground or something that the character can't pass through. But actually the sun
is in the background and people understand that intuitively, but the system just sees polygons.
We can also capture the idea of playable characters and maybe some of the relationships and
actions that they can carry out and even what happens when those actions are taken and they
interact with one another. And we can use this idea of using the crowd as a formalization layer
to help the system better understand what's happening in the world and to actually start to
predict what might happen in the world. And if we keep a small frontier we can start to borrow
a page from predictive parallel computation and try to guess what will happen, precompute this
using people in the crowd, so go ask what might happen if I reach this state. And when the user
actually gets to that state we can actually have a zero latency response. The computer already
knows. It just has to apply what is about to happen. If we can start looking at zero latency
crowds in the same way that we are going from hours to minutes and then minutes to seconds
and open up whole new spaces of applications, I think actually zero latency or a few millisecond
responses will do exactly the same. And then kind of behind the scenes, how can we use this
understanding of both the system and the crowd has about the application domain to actually
template out code or generate code either from scratch or from demonstrations or even help
workers and users import generalize examples from off-line crowd sourcing platforms and
community exchanges into specific examples that we want to work automatically. I think these
mixed initiative systems will help us go beyond just a naïve mixing of these three contributors
and actually look at how we can best take advantage of the strength of the crowd, the creativity
of the crowd, the creativity of the end user and some of what AI can do to understand and
reason about the domain. I talked about how we can support ongoing interactions with the
crowd in this talk. And I really focused on the fact that consistency is the key to success. We
need to be able to do that and in all of the systems that I showed we need to have consistent
output that doesn't conflict with itself but respects kind of the progression to get to that point
in the interaction. And to design systems that do this I've introduced the idea of real-time
crowds actually as a crowd agent, an interest in architecture for how we can create these
systems. And I'm really excited about how we can support richer and richer interactions with
the crowd via things like mixed initiative principles. Thanks. [applause].
>> Kori Inkpen Quinn: Any questions?
>>: I have kind of a question and kind of a comment. I saw your crowd parts when you were
driving your robot around and in many ways it felt like [indiscernible] Pokémon. Did you find it
there was a way to kind of exclude trolls out of your system automatically?
>> Walter Lasecki: Yeah. One thing that drove me a little nuts about the [indiscernible]
Pokémon, while it's an awesome example, basically this is a game that was set up I guess last
year that allowed a lot of people to play a Game Boy version of Pokémon, but it was 10,000
people all playing at once and just kind of throwing input at the system. I think it had one or
two modes. It was kind of mob control and a vote although it wasn't clear exactly how they
were voting. I think they were also bidding by time. This is basically two of our baselines. One
thing I didn't show is that mob was also something else that we tried that doesn't really work
because you again lose consistency over time. What we see is that when we are actually
learning the weights and using the crowd's collective voice to actually that workers, we are able
to very easily identify people who are trying to differ from the rest of the set. We can't
guarantee that those people aren't malicious, but since things like Pokémon or this navigation
things are relatively straightforward in terms of making progress, people who are not
contributing to that direction are pretty easy to detect. So yeah we can do that. Yeah?
>>: Can you define again what it was you meant by [indiscernible]? I wasn't totally clear on
why I saw the earlier systems you talked about didn't, as mixed addition. It seemed like the one
where the flying people all took video while the chorus of workers were dialing in seemed
similar to the [indiscernible] example.
>> Walter Lasecki: Yeah. That is actually kind of a nice example of a mixed initiative
interaction, but it's not something we necessarily designed for. It was kind of a byproduct of
the way we designed the incentive mechanism and because we had this rolling process where
we didn't just restrict people to kind of a single contribution, we were able to get this insight. It
was the same thing with the Chorus example where we come back and actually suggest the city
pass even when the user didn't ask for it, so I really think it's about focusing on trying to design
the systems rather than taking advantage of kind of what naturally occurs. In those systems it
wasn't the initial intention.
>>: And do you think in systems like that is it important that the end user believes that they are
actually interacting with a single actor or with an intelligent computer system versus that they
know that they are interacting with a crowd of workers? Is it important to hide that fact from
them or is it not?
>> Walter Lasecki: We were mostly focusing on hiding this fact in the context of trying to make
sure that we didn't have multiple threads of interaction. It's confusing to actually hold multiple
conversations at the same time, so from that standpoint we don't want to feel like we're talking
to a set of people. But at the same time once we have some of this filtering, it turned out that
it was actually useful to end-users to include some information about how confident system is.
So we can tell people yeah, there are multiple people behind the scenes and this is like the level
of agreement on a given answer. And people find that informative for selecting, so it's a
combination. Yet?
>>: So crowd sourcing research has exploded the past several years. What can you see going
forward? Do you see the types of fads and activities that people are applying to crowd sourcing
staying somewhat consistent or do you anticipate that worker [indiscernible] this or deviate
often in a different direction?
>> Walter Lasecki: I think we've only seen a fraction of the actual application that we could
create. Even this idea of moving towards more mixed initiative systems and collaborating with
our intelligent systems is something that really hasn't been done yet even though we keep
claiming that we have an intelligent system. It is very much designed from the tool perspective,
so I think there will be a lot of brand new applications, not just kind of solidification of prior
work. I also think that hopefully the platforms will become better and make it a little bit easier
to work in this space.
>>: What do the platforms need to become better?
>> Walter Lasecki: Right now, for example, mechanical Turk really doesn't have real-time
support. We're kind of hacking the system a little bit to pre-recruit our own crowd that has
these properties which is very hard to scale and very hard to -- it's interesting that it's hard to
scale and it's hard to one off. There's something like middle level where we need some sort of
workers but we don't need too many. I think if the platform actually natively supported this
kind of task routing to individuals who are available at a given time or at least allow people to
actually be kind of on call we would see more people working in this space actually. Yeah?
>>: A lot of the work super intensive sort of embed their crowd and from the user perspective
pretty intimate settings like for blind people it might be their bedrooms or in the transcription
sense the content might be something that you don't necessarily think you want to spread to
the entire crowd. Have you thought about privacy?
>> Walter Lasecki: Yeah. We've actually done some of our first work in this space with Jamie
and AJ, actually, where we were looking at how might you attack these systems, how might
malicious workers in these systems actually be a threat to the end user. And while we're not
saying high-level of kind of malicious users of people who would actually do bad things with the
information now, certainly that will grow over time. It's also true that a single individual is not
necessarily the biggest threat because a lot of the de-noising properties can at least prevent
bad answers. So in the blind user case where we actually want to make sure that we don't
misinform the end user, that often gets filtered out with kind of the same approaches we are
using here. If we want to prevent people from seeing it we are now looking at how we can
create intelligent filters that actually use people but never reveal that information, which I can
talk a little bit more about off-line if you're curious.
>>: You mentioned zero latency crowd. How is that possible?
>>: Don't take my question. [laughter].
>> Walter Lasecki: The idea here would be that as we start to formalize more and more of the
information about the domain that in this case we just created, the system can start to run a
planning process of what might happen. Where could we go from this point that this has
reached? Where do we expect go based on prior interactions? And once we can start to
narrow down the size of that guess set we can actually start to precompute, maybe on the 3
second front here, if that's our current latency, what might happen, get that answer and then
have it ready when we reach that state to immediately fire. So we are talking about a few
milliseconds, whatever it takes a computer to now respond rather than whatever it would take
to be surprised by that new state and then go out and get an answer for that exact moment.
>>: There was some other lady who was wondering at some point about where these things
are going. One of the things that you talked about a lot [indiscernible] system is latency and
trying to get that down towards zero. I guess also I am seeing in a number of different things
that you have done is that some still feel that you have this specialized deal with this problem
and this time working kind of solution and find the right information. These are the problem is
different like do you envision that we will have like sort of more general solutions to these
problems for reducing latency or is it going towards just building like a tool set of like 58
different kinds of things? Are there generalizable solutions? Is it kind of a new problem
[indiscernible] exploration phase? What's your sense?
>> Walter Lasecki: Yeah. What I focused on here is in the broadest sense focusing on the type
of input and output and going back to this crowd agent and architecture and what types of
input we have to deal with, a stream that's faster than individual people can contribute to, that
kind of thing. So I think that there are those broad classes and if we see the same problem or
the same kind of properties of the problem we can use the approach. There are also going to
be these more fine grained optimizations like time warp where we look at how people
complete this task for some of the specific human factors for doing that task that we can then
augment this with. Scribe is a good example in that the reasonably general approach is actually
something we've done very different things like activity recognition using a very similar process,
but there we didn't use things like time warp.
>>: We are going to take it in a different direction. Mechanical Turk is for the latest crowd
sourcing crowd workers, so it's about getting a crowd to work and help me complete a task.
The robot driving is a little different, the same as the Pokémon game. That's more crowd
participation, so it's not like I'm doing a task for you. It's like I want to play this game with a
crowd. Has there been much of those types of those activities, where it's not that I'm just
working for somebody, that I'm doing this for my own enjoyment with other people?
>> Walter Lasecki: Not so much within the computation space. You do see other things in
crowd sourcing that maybe are. You can even think of things like other sourcing for community
events as a type of this where we are all working towards the common goal, but there's not
necessarily a computational process involved in that. I think that it is interesting that when we
jointly play a game set up in Legion is self-directed. It's not necessarily true that the robot
driving was. So there we actually were specifying the goal, so we are looking for a process. But
the gaming is a very good example using the same system. There we were actually not hiring
people but letting people as you kind of saw in the picture sit on a couch and collaboratively
play this game. I think that we can transplant some of those ideas from the computational
space to kind of the more general self-directed crowd sourcing. I also think that this isn't like
Apparition or trying to use this one level abstracted computational process where we know
people are completing something. Computing. Using [indiscernible] but just with their mental
process to get to the response that we need and that's integrated in a certain way but there it's
much more undirected and I think what we can learn from kind of teamwork settings informs
how we do the coordination in things like Apparition where we don't have I know I need to do
x. It's kind of up to the workers to figure out what they need to do.
>> Kori Inkpen Quinn: Any last comments?
>>: That just made me sort of think about do you have thoughts from the workers perspective
as to what it's like to do these small tasks or participate in these things? Because when you
think about playing the game that sort of us the need it rather than it being an external worker.
>> Walter Lasecki: There is a lot by type of task. So what we see is a lot of the assistive
technology work where people are actually even more engaged than by micro-tasks and they
feel like they have contributed a bigger piece when we use continuous tasks. But in general
knowing that they are doing good is something that helps. I think that in general from the
worker side, kind of this line of work we are looking at, we're continuous with our interaction
with the system and that also avoids a lot of the interruption problems that we see in
traditional micro-task crowd sourcing. There is actually a paper that looks at some of the
detrimental effects on workers to having their workflows constantly interrupted in different
ways that are pretty common currently when we use micro-task and the punch line is that it
can take people up to twice as long, which of course they are not paid differently for, so was
just a little bit of wrong routing it can essentially cut down their effective pay rate in half. So
this is trying to get away from some of that, some of those problems.
>> Kori Inkpen Quinn: Okay. Thank you. [applause].
Download