Document 17844385

advertisement
>> Jaime Teevan: All right, so welcome. Thanks for being here, and thanks for tuning in, also.
It's my pleasure to introduce Jeff Rzeszotarski. Jeff is currently finishing up his PhD at CMU,
where he's advised by Niki Kittur. He is an HCI person who uses InfoVis solutions to solve
problems in social computing and crowdsourcing, and Jeff has built some really interesting
systems, including what is now known as DataSquid and published a lot of great work at CHI,
CSCW, UIST, HComp and received a handful of Best Papers, as well, from CHI and UIST. Jeff
is familiar around the lab, as he did an internship with Mary Morris a few years ago, and he's
also an MSR Fellowship winner, and I know I and others here have cited his work in our papers.
More recently, he also received the Carnegie Mellon Student Innovation Fellowship, so I'm
looking forward to his talk today on what he learned about helping people make sense of data.
And I also thought I'd mention for those of you online, I have the question-asking terminal, so
feel free to ask questions, and I can say them out loud.
>> Jeff Rzeszotarski: Awesome. Okay, hi, everyone. Jeff Rzeszotarski. So I'm just finishing
my PhD now at Carnegie Mellon, and like Jaime said, I kind of work at the intersection of data
visualization, social computing and crowdsourcing, and it makes sense to start my talk, I think,
with data. So here are some interesting data sources that are pretty darn huge. So Wikipedia just
recently passed 5 million articles in size. Mechanical Turk as I sampled it a couple of weeks ago
has over 200,000 different tasks for workers to do, presumably organized by many people. And
even on a consumer site like Zillow, we've got over 2.5 million data points, or in this case,
houses you could purchase. And the thing about these massive data stores are, we actually are
using these in our everyday life in a really meaningful way. So Wikipedia, if you're at all like
me, you incorporate that in your everyday knowledge. Rarely a day goes by that I don't use
Wikipedia. Mechanical Turk is used in products that face us. These were quality assurance,
search, translation approval, corpus building, all sorts of things in the back end. And even a site
like Zillow, it's ending up being used for one of the most important decisions in a person's life,
buying a house. This is serious economic impact. The problem is, these aren't necessarily
perfect. Wikipedia has an overwhelming amount of historical data. If you think of a Wikipedia
article as sort of like an iceberg, you can see the article on top, but there's gigabytes of historical
data underneath it, collaboration by thousands of people that's hidden from you. Mechanical
Turk, if you're organizing one of those 200,000 different tasks, you get a stream of raw results
back, and it's not necessarily obvious which ones are good, which ones are bad, how they all
contextualize together. And even a consumer site like Zillow, it's rare that a person making a
decision exploring data has a perfect criterion or makes the perfect optimal choice. It's always an
involved, exploratory process among a lot of data. So the common thread among all these
different domains is that that there's really no one-size-fits-all solution here. We need people to
understand the data, understand context, be able to make an informed evaluation of history or
which purchase or which house to buy or which Mechanical Turk task is the good one. But
there's no real easy way to do this. And so what I point to in my line of work is the idea that we
need to support data exploration. We need to help people not only find the thing that matches
their expectations but in fact understand what their expectations are in the first place. And so my
core guiding principle in my work is really how do we design systems that help users see and use
context in complex data? And this idea of context is really important. If you think about when
you're making a decision or trying to make sense of data, you can't just come into the data source
and know immediately what it is you need to see and why. You need to see a bunch of
examples, a bunch of counterexamples, to build up a model, to build up an understanding. And
so in my work, I focus on a number of different domains, identifying how we can surface context
to help people complete complex tasks and make sense of data. This might mean that people can
perform more quickly or more effectively. They may be able to explore more features in the data
at once, or their findings might be better. They may have more satisfaction after exploring this
data. At this time, I want to draw a distinction here between directed and exploratory tasks. So
what I call directed data tasks are ones where you kind of already know what you're looking for.
If you think about Google Search, you have some search terms. It is well able to fulfill those
requirements. If you have discrete criteria, systems right now are very effective at giving you
exactly what it is you're looking for, whereas exploratory tasks are much harder to afford. If you
think about trying to explore, it involves building a mental model. You don't come in with a
perfect representation. You build that representation from the ground up. And only then once
you've built a model can you generate insights, which are different from a perfect decision
matching criteria, and your decisions themselves are kind of integrative processes. It's not this
matches, I'm done. To link this a little bit to literature, we can consider what cognitive science
has done with sensemaking. So sensemaking is the process of constructing meaning from
information encoded in data. And Weick I think describes this really nicely as an iterative
process. You're developing a mental model, but you don't come into the data with it
immediately. You iteratively build it up over time and over exploring. Pirolli and Card have a
really evocative term, which is foraging. You're searching around, for example, so you're
searching for necessary data to build up an understanding. And Perer and Shneiderman add a
really nice complication to this, which is all of us in this room are familiar with using statistical
tools, kind of specific analytics, but what they found is that when you're making sense of data,
very often, you have to start with a broad exploration before you can even apply those statistical
tools. So you can imagine a narrowing, where you start with an exploration, where you need to
make sense of data iteratively, and only after a while can you actually apply statistical
techniques. To illustrate this even further, let's look at a sensemaking model, so this is Russell et
al.'s model, which I think is pretty effective in conveying what sensemaking's iteration is all
about. Imagine you have some data and you come into this task with a little bit of understanding
about the data. I maybe have a task that's telling me what to look for already. In the green box,
you start searching the data for good representations. In other words, you're trying to find data
that match what it is you think the data is all about. So if you have an example, if you're looking
for a house to buy, you may already have some existing criteria that you can find houses that
match that ideal, as you're trying to make sense of what's out there. In the blue box, you take
those examples, those representations, and you encode them into your mental model, so in this
box, what you're really doing is taking those examples and asking yourself, do they fit in my
model? How do they fit? Do these match my understanding well, and if so, that's great. But if
not, you end up with residue, that red arrow. So as you explore data, you're not going to find
every single example fits your mental model. In fact, some things don't fit your mental model,
and as you explore more and more data, you build up this residue, stuff that doesn't fit.
Eventually, you hit a crisis point where you have to adjust. You have to change your thinking
about the data so that you can accommodate this residue. And so if you think about this process
more holistically, we have iteration where you're trying to find examples. You try to find
counterexamples that are signals that you need to switch your understanding, and over time, you
accumulate less residue, because your model is getting better and better. This points to a couple
of different ways that we can improve sensemaking for data explorers. In particular, we could
help them find those representations better, direct them to ones that either confirm or disprove
their hypothesis fairly quickly. Similarly, we could also improve the iteration of the process, so
how much data can you cover in a given loop of this process? This can help people make better
explorations and develop better mental models, but it's even more complicated than that. If you
look at decisionmaking with data, people rarely get the perfect optimal choice. In fact, most
everyday decisions are made without a full examination of all the available options. The best
option may be missed. At Carnegie Mellon land, we call this [satisficing], the idea that you
make a decision with the best possible constraints you have. Maybe you just don't have time to
find the perfect one. Complicating this even further is the idea that we have physiological
limitations on our data sensemaking process. So working memory is only seven-ish units in size.
You cannot store a huge amount of data in your head as you're performing a task. We have
limited attentional resources, so you can't focus on too many different targets as you're exploring
data without getting overwhelmed and your performance degrading. And even things like
feelings of self-efficacy, expecting that you're going to do well in a data sensemaking task,
actually impacts your performance. If you feel like you're going to do a good job exploring data,
you actually do. So to operationalize this a little bit, let's look at buying a house in Pittsburgh. If
you're going to be buying a home in Pittsburgh, which is a data analysis task, and these numbers
actually are reasonable for Pittsburgh, for those who are not necessarily believing me -- it's
amazing. So you often come into a data task like this with some existing expectations or
understanding. In this case, I have bedrooms, baths and budgets, but as you look at some more
data, you realize there are some criteria you didn't expect to see but really do care about. So in
this case, maybe you realize parking is an issue in Pittsburgh, and you really care about a nice
neighborhood. However, as you keep exploring, you find, you know what, a nice neighborhood
is actually pretty expensive. I want to live in a nice place, but it's going to cost me, so I have to
re-adjust some of my existing criteria to match. And as you keep exploring, you end up
accumulating a lot of different criteria, which speak to a really deep understanding of the data.
But as you can imagine, trying to find something that matches all of these is a really hard
process. So this idea is really characterized by exploration. You don't just find the point the
point you're looking for. You build up a model as you explore by seeing a bunch of contextual
examples. There's this idea of hypothesis testing. You think to yourself, all right, well what
about a nice neighborhood? What would that look like? Well, let me experiment and see what
that might be. It's an active, iterative process, as I mentioned earlier. So switching gears, let's
look at how it goes with an existing interface that consumers may actually use. So this is
Zillow's visual tool for identifying houses. You can see each red dot is a house in Pittsburgh.
And I can actually pretty readily, using their faceted browsing tools, pick out some criteria, and I
can encode these directed criteria really easily. The houses will disappear. The ones that still
match my criteria will stay there, but if I want to ask some questions of the data, like what if I
want more bathrooms, what if I want a bigger place or a smaller place, I have to go through a lot
of different interface steps and then see what's appearing and disappearing in order to gain any
understanding. This process has a disconnect with what people are actually doing. For known
goals, that interface is really good, but for exploration, it gives you really hard feedback. Either
points are there or they're not. You don't know why they disappear or why they appear, based on
your filters. Interactions to explore and test different values involve a lot of different steps. And
so what I point to in my work are ways that we can surface context in a really fluid, natural way
that's relevant to the user so that we can accommodate exploration and deeper decisionmaking,
and I'm going to do that in this talk in three different domains, where I focus my work. So to
begin, let's look at Wikipedia. I mentioned earlier that Wikipedia pages are sort of like an
iceberg, and here's an example of one page. I've actually contributed to this, though you may not
be able to tell just from looking at it. And if I asked you, who's contributed to this? Are there
any viewpoints that are particularly strong on this, what are the cultural background or the
gender of the people who've contributed to this page, were there any debates going into it? You
wouldn't be able to tell, just by looking at its current state. Wikipedia is an immensely
collaborative artifact, but the collaboration is hidden from most everyday users. To get to some
of this collaboration, because maybe you're going to make a contribution, and to make a
successful contribution, you need to know what's already been tried, what issues may be present
that you can't necessarily see immediately? So one thing that we can do is go to the little button
in the upper-left corner, Talk, which is discussion among editors, and for a big article, this
actually poses a serious barrier. So here's a discussion among editors for Scientology, which is a
controversial article. This here is several dozen pages long, and you'll notice at the top archives.
There are 30 different archives, each potentially dozens of pages long, all containing discussion
among editors. So if I asked you, what are people talking about in scientology? What should we
do or not do based on discussions in the past, you wouldn't really be able to say that much.
Maybe you could use the search box to search for the word "cult," because you think maybe
there's a debate about that in the past, but it's not necessarily clear what you should glean from
this data store. Instead, maybe let's look at the past revisions, the past things that have changed
in the page in actuality. However, for an article like the article Abortion, just the diffs of
changes authors have made over time is roughly 20 copies' worth of Pride and Prejudice in
length. So we can't exactly expect you to dig into this content, either. So you can see the sort of
difference between exploration and directed. You could search the discussion pages, you could
search the revision, for a specific term, but if you wanted to gain a general understanding, you
really have no ability to parse through this data. It gets even worse. So we conducted interviews
with three expert Wikipedians in the Pittsburgh area, and one of the things they pointed to
immediately was this idea of conflict. People are fighting on Wikipedia. They have zealotry
about certain topics, and for newcomers especially, this could become a serious issue. If you
wade into a conflict zone unexpectedly, your work is going to be thrown away, perhaps in a
hostile manner, and you'll never come back. One of our experts, in fact, has received sort of a
Wikipedia version of a no-contact order, because after a battle in one particular Wikipedia
article, they were stalked by another editor. The more interesting part, coming out of these
interviews, is about information overload, so we asked all three of these Wikipedians, what do
you need to do to make a successful contribution? And they all said to us, well, once you have a
region of the page you're interested in, you want to check the discussion and check past edits to
see what's happened, see what sorts of discussions you're having there, who the stakeholders are,
if there's any conflict. We then asked them, go ahead and make an edit for us, and we gave them
a couple of editing tasks. They did not use any of those resources. Immediately after telling us
history is important, they did not use historical resources at all. We asked them why was there
this sort of disconnect? Why didn't they use them? And they said, there's just too much. I'm
never going to be able to find out anything in a tractable amount of time, so I'm just going to try
out and see and hope for the best. So we have an opportunity to do better here, but if you look at
existing interfaces, we're still existing kind of on the high level, rather than digging into the
actual substance of the conversations and activity, so Wiki Dashboard in the upper left shows
you kind of temporal relationships of different authors, who's contributing right now and how
often. History Flow right here shows you the evolution of the page in graphical form, but it's
hard to get down to the what changes are people actually making and why level, and the lower-
left corner, you see Snuggle here, which is a tool for administrators to socialize and interact with
new editors, situated within the contributions these new editors have recently made. But Snuggle
actually was appropriated as a tool to target new users for being kind of too engaged and kind of
too inexperienced, so a lot of the administrators who saw these new edits in fact were throwing
them away, saying, nope, this isn't ready yet. You need to do more. Because they didn't
understand necessarily what the edits were actually doing in context. So with this problem kind
of in mind, we took all discussions on Wikipedia, the particular article, and used topic modeling
to situate them within a small section of each Wikipedia article. So the idea is, as you're
browsing Wikipedia now with this Discussion Lens tool, within each section, you can start to see
important discussions for that section that led to its evolution. So for example, if you're on a
particular contentious article, which is the article on hummus, the food -- it turns out it's a very
conflicted article on Wikipedia. When you're browsing the section on etymology, it's important
to know about this discussion right here. These are people debating whether the Oxford English
Dictionary is an authoritative source for the origin of the word hummus. Is it a Turkish word? Is
it an Arabic word? It turns out that if you invoke that it's being a Turkish word, these authors are
going to strike you down, and we see that throughout the history of this article. So now, if you
are reading the article on hummus, you can imagine seeing that as one of the recommended
related discussions. You'll be aware that this is an issue that you may not want to wade in. So
we're cutting down the complexity here by surfacing relevant contextual discussions within a
much smaller section of the page. And in practice, this does make a difference, so we asked
participants in a lab study, between subjects, either the Wikipedia interface or our interface, in
blue, to write a guide to an article section. And the idea was, if you were talking to your friend
and telling them what they should contribute to this new section, what are some openings, what
are the stakeholders? Are there some issues you should avoid or consider? For a small article,
like the article on hummus, our tool really isn't reducing the complexity, any. People are just as
able to write decent-quality guides based on history, but for a large article, like the article on
Alan Turing, the tool, by crunching down and providing just contextually relevant information in
a given section, actually does provide users a much better picture of historical data. This is on
the reader's side. I'm just now for CSCW working on patching into this what the editor side
looks like. So based on this feedback, do you actually make a better contribution to Wikipedia,
knowing this extra information? This is the question we're looking at right now. So I talked
about discussion, but what can we learn from past contributions themselves? So I've also done
some work modeling past contributions to Wikipedia, and a lot of times if you look at it, this is a
stream of different contributions to the article on Scientology, you'll see comments like this -undid revision by person. The idea is a lot of times in Wikipedia, work is just thrown away
wholesale, either because it violates norms as part of a conflict or isn't wanted by a particular
community surrounding an article. And so we constructed engram models to try and get an
understanding of what sorts of content are valued or not valued by editors, using machine
learning to cut through the complexity of this large historical store. So for a simple change like
this, changing jumps over to walks near, we can construct a feature vector that captures the
changes they made and considers whether or not that edit was accepted by the community or
rejected -- in other words, reverted in Wikipedia parlance. And if we do this over 150 different
articles, it turns out that we can actually pretty accurately predict whether or not contributions are
likely to be thrown away by the community, just by the words that they're choosing to change.
So the idea here is not that we should just tell a person, nope, yours is going to be thrown out,
yours is. Rather, that we can use this model to gain an understanding of what things are
particularly risky to do to a Wikipedia article. So this is model weights for the article on genetic
engineering, and so you can see at the top, dude is definitely something that you should not
contribute to the Wikipedia article on genetic engineering. Surprise. But maybe more surprising
is shouldn't. It turns out shouldn't is a prescriptive term, and Wikipedia's neutral tone does not
allow that sort of language by policy. The header you see there, genomique engineer, there was
a debate in the article about whether this header should be included, and so our model picked up
on the fact that that was a conflicted area of contribution, whereas of course Monsanto is a lessrisky term. Interestingly, exceedingly and involves, depending upon context, could either violate
Wikipedia policy or not. So the idea here is that this model is capturing some really interesting
features within the textual data store. And so the future I think is a really interesting possibility.
In the first line of work, I was looking at how to collapse different discussion to editors while
they edited in a particular article context, and you can imagine making this into a sort of
recommender system. If we know you're editing a particular section, we also know the changes
you're making, how do we present relevant data to you that's actually going to change the kind of
contribution you make? How can we be prescriptive and say to people, we noticed you're adding
Turkish to this particular section. That's no good. Here are some examples for why, and here's
some discussion that's relevant -- to actually improve their quality of work. Similarly, we can
use history to direct people to new interesting areas to contribute. If we notice that we don't have
much recommendations for a certain section, there isn't much activity, maybe that's an easy entry
point for a legitimate peripheral participation as newcomers socialize. And perhaps most
interesting to me right now is being able to construct FAQs and guides dynamically based on
historical data. So if we know what's going on in a historical article's section, can we construct a
guide for that section just from the kinds of comments people are making as they change work in
that section and the kinds of discussions that are happening? Go ahead, Mary.
>>: So I'm just wondering, so I like these ideas, but I also see how, particularly let's say with
Wikipedia, where there is a certain pervasive culture among the editors, which is very
exclusionary and presents certain points of view, and these tools and ideas you have might help
someone to fit into that culture and better operate within it, but doesn't maybe address the larger
question of how to effect change beyond that. Do you have any thoughts on that?
>> Jeff Rzeszotarski: And so I think this is actually really where I think we can effect change.
So this first one does run the risk of only enhancing the orthodoxy of Wikipedia, because we're
telling people, avoid this, it's dangerous. It doesn't change the dynamic at all. This work draws
on work on Wikipedia socialization, which generally says that newcomers tend to go to
Wikipedia articles, make one contribution, get thrown away and never come back. The barrier
for Wikipedians is in fact these early edits. And once they've socialized a bit more, they can
handle wading into riskier areas and taking stronger viewpoints. And so one core possibility for
investigating historical data is to provide better entry points. So these may be lower-risk entry
points, in terms of an article section that hasn't been contributed to that much, but this gives
people hooks to begin effecting change, and can help bring in more diverse audiences. In
particular right now, Wikipedia is suffering from a gender and culture problem. It's
predominantly white males who are contributing. And if we can provide better entry points that
are less hostile for a variety of different contributors, maybe we can start to change the culture in
that manner. Yes?
>>: I was going to ask, it's sort of a related question. So yes, the statistics are pulling out a space
that reflects the culture or the people that contributed.
>> Jeff Rzeszotarski: And it's all temporal, of course, too.
>>: Yes, so my question is sort of specifically about the system that did the topic modeling. Is
there a way to weight the importance of the topics? Do you bake that in?
>> Jeff Rzeszotarski: Yes, I've got it here. So there's a star in the upper-right corner or
something. We're playing with this out. We prototyped this out to finish it up for CSCW, is
collaborative filtering. So the idea is that if people actually find this information valuable, we
can also use that to operate it, and you can imagine constructing more meta information out of
this, so maybe as people read discussions or as people close discussions, we can construct
summary information that's more condensed and more relevant. I didn't -- yes.
>>: So once you're established, then people can go and rate.
>> Jeff Rzeszotarski: Kind of in a post hoc way. Also, I haven't really even talked about
temporal issues, which are also an interesting question. Do things decay over time or stay
consistent? And scale -- Wikipedia pages, singularly, like I've been discussing now, or all of
Wikipedia as one model? So this is one particular final vision. This is a prototype we're
thinking about right now. You can see we're giving people real-time editor feedback in the left
bar unobtrusively. They get more information about what they may or may not want to add. So
that's thinking about context in Wikipedia as the historical context hidden beneath the page.
How can we expose that to people in a tractable way so that they can make sense of data, and I'm
using ML and Vis to help get us to that point. Switching gears to crowdsourcing, I think
everyone in this room is pretty familiar with existing crowdsourcing marketplaces, including
Mechanical Turk, which is more micro, and Upwork, which is more contractor organized. In
short, crowdsourcing workflows kind of follow this pattern. Imagine I've got a big corpus of
images -- in this case, adorable puppies -- and I want to tag each of these. I could one by one go
through each animal and tag it, or I could give each picture to a single person in parallel and have
them all do it. This holds a really interesting possibility for getting a lot of human judgments
really quickly and scalably. The challenge is, not all results that you get are good, especially
when economic motivations, like in the case of Mechanical Turk, start to come into play. When
people are extrinsically motivated, they may try and find ways to game the system. So I asked
people to tag that image, and you can imagine getting really eager answers, answers -- and we
asked them for three to five tags. They gave us three or no tags at all, hoping we won't notice
that they didn't give us any work. This is good, because these people then are making the most
possible hourly wage for the least amount of effort and hoping we won't notice. Some
Mechanical Turk workers call this cheating, the idea that they're cheating you out of particular
value by not contributing. And if you look at Michael Bernstein back in 2010, the find, fix,
verify paper, they pegged it at 30% of submissions are of that so low quality you can't even use
them. These days, 10% to 30% is about the rule you should use. So you think, then, we've got
to figure out which one -- based on each submission you got, is it this, this or this. And if you
look at the existing Mechanical Turk interface, this is what you find. I asked them to help name
my company, so each row here is a list of company ideas brainstormed by a Mechanical Turker.
You notice I can get the names they've given me, so the raw data they gave, their approval rate -in other words, have I kind of reputation system-wise approved them before, and if I dig really
deep into this interface, I can also see whether they worked for a long time or a short time, but
that number is known to be incredibly unreliable. So how do you find the good work? Existing
research has looked at this in two different lenses. One is design better tasks, so people have to
give you good work, which is really hard. This usually involves a lot of iteration and a lot of
incentive design. You can also in a post hoc way analyze what you've got and try and find the
good stuff. So one way to do that is by seeding your task with gold-standard questions. So the
idea is if you already know the answer to some questions, you can put them into your tasks and
see which workers get all of those right, because then obviously you trust their results more. If I
asked you, though, what's the gold-standard example for a poem, what sort of restaurant review
would tell you whether they're a good or bad worker, complex work, this all kind of falls apart.
It's hard to understand more creative or more varied inputs. Also, I might add that workers are
known to game this. CrowdFlower in particular has been known to have pools of workers who
learn the gold-standard formula and only answer those properly. You could also have multiple
workers redundantly do the same task. For instance, if you're transcribing a video, you can just
pick the most common answer or most common substrings to get a decent transcription. Of
course, there's no most common short story, and having six people do the task of one person adds
a lot of redundancy into your system and can crunch down a lot of the diversity that really human
judgment is valued for. So in this line of work, I propose a really different signal for evaluating
the quality of work in a crowdsourcing workflow, and even just understanding in general how a
crowdsourcing workflow is going. And that is thinking about the middle, between designing a
task and getting your results. The way workers work can tell you a lot about not only their
performance, but the task and workflow in general. And so to give you an example of what that
looks like, let's consider two workers taking an ACT practice test. So you read a passage, reply
to some multiple choice questions. Worker A accepts the task, scrolls down, clicks and answer,
clicks an answer, is done. Worker B accepts the task, pauses here, scrolls down, pauses here,
scrolls up, scrolls up, pauses here, clicks an answer, pauses, clicks an answer, is done. And I'd
ask you, not even knowing what the answer to these multiple choice questions are, which worker
did a better job? You'd probably say B. It's not a super-hard question, unlike these questions
right here, which are pretty difficult. You see brouhaha here. That's pretty tough. We associate
Worker B's behavior with diligence, right? Those delays were them checking the passage, and
our knowledge of the task actually informs a lot about their end performance. So we constructed
a model that measured workers' work behavior while they worked using clickstream data. So
here's a worker typing, submitting hello, really low-level events. We had a bunch of workers
complete different tasks, and from those low-level event strings, we extracted a bunch of general
features that were more comparable across workers and more quantitative, so things like how
long they spent, how much did they pause to think while they were typing, those sorts of
behavioral features. We had workers do three different kinds of tasks, pick the nouns, tag an
image or that practice test you saw earlier. And in practice, just looking at the way people work
can really inform -- give us information about their end product. So calling out image tagging,
we had two raters rate whether they thought the person was cheating us, or in other words, giving
us bad results intentionally, and our model just looking at behavior got 93% accuracy in terms of
whether a person was cheating or not, based solely on behavior. And we had two raters also rate
quality in a five-point Likert scale for those tags. Our model just looking at behavior can get
within about 0.5 on a five-point Likert scale of human ratings, just looking at behavior, not even
considering the end tags. So this is really interesting. This gives us the idea that behavior is a
really valuable signal for understanding quality of work. But it also neglects to consider a lot of
really interesting features. So right now, we're crunching behavior down into just a simple
outcome measure, pass/fail or rating. What about individual variability? What about different
ways of working or different cognitive strategies? So in building on this work, in the second
paper, I can start to do visual metaphor for these sorts of traces of activity. So here, the blue tick
marks are people clicking on something. The orange lines are people scrolling up and down the
page. The red boxes are people typing. You can see this person paused in the middle of typing,
and black lines, which you won't see here, are changing focus or tabbing to a new tab. The idea
here is now we can actually go a bit deeper than just good or bad, so I'm going to say who did the
poor work? But now, why? You probably would say, A, these are people doing the ACT
practice question, but A didn't do the poor work because they had a shorter trace. In fact, time
spent on task is usually a poor indicator of performance. Instead, what you see in B and C are
this upside down V pattern. These are people who are reading the question and then checking
the passage for the appropriate answer, then scrolling back down again to pick their answer.
We're now actually getting to their behavior, in other words. This visual metaphor lets us
understand not only did they do good work but start to explain theoretically why that may be.
We can find out some really interesting things by looking at worker behavior. We asked them to
tell us their favorite color and use a color picker to pick it and then describe it to us. There's a
perfect, or a near-perfect correlation between the delay they spent before they told us the color
name and the length of the color they gave us, because they were picking the perfect shade in the
color picker. More operationally, we asked people to translate from Japanese to English a
particular passage, and one of these is not like the others. Only one of these has red blocks
indicating typing, so only one of our workers, in what I believe was a 10 or 20-worker pool,
actually did any typing while they were completing this translation task. Everyone else used a
translation service, and so the most common answer to this translation passage is this one, which
if you can read Japanese does not match all that closely, and if you can read English, you can tell
is not terribly sensical. This is the worker in green, the one who stood out in the behavioral
traces. It turns out, they actually used machine translation, as well, but they took effort to
proofread and correct the machine translation before they gave it to us. So this still wasn't a
perfect result, but we were able to pick the best result possible, which was not the most common
one. These timelines were part of a much bigger interface that lets you triangulate down on the
relationship between the output workers gave you and a number of different representations,
their behavior and some quantitative features of valuing their time, how much thinking they were
doing and things like that. The idea here is that you can pick a few interesting behavioral traces
out and use distance and ranking measures, then use small ML algorithms to pick out more
people like them, so you can iteratively build an understanding of different kinds of strategies or
different kind of working habits among your workers. Yes.
>>: Question. This is [Bob Frankel]. What's your sense on now this would generalize? For
example, if you provided an API and people in a schema, if people could just fire data at you
from any domain and just say, look, here's operationalized by me those different metrics on some
write an email or doing whatever task? What's your sense? Do you think it would work?
>> Jeff Rzeszotarski: Yes, so A, we don't know, and this is something really interesting to me, is
kind of getting beyond just performance to thinking about, are they ESL? Do they have domain
expertise? Are they checking their email well or not? Things beyond just this pure labor pool
performance. In the initial ML work, we had really good results porting the model, so we could
take a model from one place like notification and actually accurately predict whether a person
would pass that reading comprehension test. So it points to maybe there being archetypes for
different kinds of tasks, but a lot left needs to be done. You can think of similar behavioral or
interaction patterns between multiple classes of tasks. It's most certainly task directed as well as
application directed, so there's probably some interaction of the two. I would really like to
investigate further, in the lens past just pure quality. Additionally, you can imagine giving this
right back to the contractors working, so can we tell contractors, we noticed you had the skill.
Maybe you could find tasks that are more aligned with this that deliver more value. Or maybe
we notice here you're getting fatigued. Why don't you try something fun? Here are some
suggestions. So giving power back to the contractor through self-awareness. Similarly, we can
actually give organizers a much better picture of what's happening, so in that ACT practice test,
you can imagine telling organizers, hey, I noticed that the good people were checking their
answers. Could you make your task design such that people check their answers by design?
There's a really nice, fruitful cycle where you learn from the different strategies workers are
employing on the fly as you develop better and better versions of tasks. So context in
crowdsourcing markets really means discovering human behavior as people complete a task and
surfacing that to task organizers, so that they can make a better adjudication about the
performance or nature of their working pool. So once again, it's giving them the information
necessary to understand and then act. In the last portion of the talk, I'm going to briefly touch
upon a more general approach, which is helping people see more context in their own
multivariate data. So multivariate data is something that I think we're all pretty common or have
a good amount of experience with. Each row here is a brand of cereal. Each column is nutrition
information in the back, and you can see, if I asked you, find the correlation between
carbohydrates and sugar, you may intuitively know there is one, but going number by number
may be too difficult, if you're just starting to look. So to get a better visual understanding of the
multivariate data, researchers have gone in the direction of visualization techniques. So here's an
early example of Film Finder, which charts out films on kind of a temporal and a rating axis.
The neat innovation here is that you may not be able to see -- you may not want to see every
single film on that chart. You may only have certain interest areas, and so we can use dynamic
querying, these sliders over here, to filter down what you're looking for, and the stuff
accordingly pops in or out of the screen. What if you want to see more than two dimensions at
once without scrubbing that slider to see? We can instead stack charts, so now we have three
different dimensions of data showing in these stacked charts, and we can use brushing to help
zero in, because the attentional load for trying to find certain regions is quite high. If these
weren't colored, it would be hard to figure out where the clusters lie, at least certainly the green
and orange. So the attentional load is high. We can also use really advanced visualization
techniques like parallel coordinates, which are really effective at seeing trends where values
change abruptly. So each point is a row now, and it crosses these vertical lines on its values, so
you can see that the orange tend to go all down, the blue tend to go all up, but if you were
untrained, this could be pretty overwhelming, especially if you were an experienced analyst.
And if these weren't colored, would you necessarily be able to see it through all the noise? So
the core issues I'm identifying in these sorts of approaches, which I'm pointing to as limitations.
These approaches certainly are incredibly valid and work well. I've used them. Hard constraints
can make it hard to track values as they change over time, so as I move those sliders in Film
Finder, stuff appears and disappears. Training can be a serious issue. And also, they can be high
load. Those stacked plots are really hard to interpret at times. So interestingly, a really
wonderful thing has come up in the past decade, where touch devices have become not only
common but incredibly used by everyday people. Everyone owns a tablet or smartphone in
America these days, at least a high proportion, which is a shocker. And these devices have a
really nice property. They bind really closely interaction with response. They occupy the
physical space of a person, and they also afford a really interesting potential in terms of
naturalistic visualization systems. So I'm defining these as systems which employ interactive or
visual affordances that resemble real-world phenomena. And the idea is, we can use touch and
these natural feeling systems to get really close to users. We're leveraging their inherent
expertise, so I know if I drop this, gravity will pull it to the table. And even further, this is a fluid
thing, so it doesn't just pop from here to here. It actually fluidly transitions all the way down as
part of its fall. These sorts of interfaces encourage a lot of experimentation and play. If you
think about using a tablet, it's generally a playful experience. And so in this line of work, in the
Kinetica project, I asked the question, what if we used physics-based approaches to help people
explore multivariate data, leveraging this idea of fluidity and little training expertise, because
people already know the models and using physics metaphors as applied to actual data
processing. So to give you an idea of what I mean by a physics metaphor, here's one. This is a
kitchen sieve. This is actually a really great filter for data. So not only do you see the particles
that pass through the filter, the small cornmeal, you also see the stuff that didn't make it, and you
encode in the process of shaking this filter out the act of filtering. It has really nice properties in
terms of amount on either end and the action. So you can see here in this video, I'm doing the
same thing to data now. Some things pass through. Other things don't, and I see both ends of the
filter. And I recall it and encode it because it's an action I take. We can use magnetism to pull
points of the charts, and we can emergently combine different physics-based tools together to get
really complex data interactions. And so here we're filtering out some points, charting them and
then highlighting some that match a criterion. To understand where and how these particular
physics-based Vis approaches are good we conducted a small between-subjects user study,
comparing Excel to this new approach. Participants first received training in either case. We
thought these conditions balanced out. Participants tended to have more Excel experience, but
Excel is comparably harder to use, so we thought there was some leveling going on, because
Kinetica's training cost was much lower. Once participants were trained, they were given some
basic stats questions to make sure they understood the technology. Then, they completed two
different tasks. The first was, here's some data, find the perfect car for you. Here's some
example criteria to go on. The second task was a set of people who were on the Titanic when it
crashed. We gave users an open-ended exploration task -- find out as much as you can in this
data set. Here's an example of what one participant found in the car-buying task. We asked
them, what are you looking for in a car, and they immediately pushed all of the points out that
were hatchbacks. Apparently, they really did not like hatchbacks, but you can still see them
here. They graphed by weight, because they have a hypothesis that heavier cars do better in the
winter, and you can see they encoded this three-dimensional sort to capture the distribution of
power versus fuel economy and filtering out based on their budget. Interestingly, the participant
did not just go and say, you know, this is the optimal one. It has the best mileage, or that
Porsche up there is the best one because it's the most powerful. They gauged the bulginess of
this distribution and said, I really want something more like that, because that point's in the
middle of the road in a lot of different features I care about. It spoke to their deeper
understanding, as opposed to just matching the criteria optimally and perfectly that they started
with. Here's an example of a participant in the Titanic condition, and they're actually doing a
four-dimensional query here, so points are being pulled to a particular place in the chart. In this
case, we have cabin class and gender of passenger. They're colored by survival, so the people
who died are red, the people who lived are blue. They already noted that more women survived
than men. I'm sorry, this is kind of a macabre example. But they were interested in what about
the children on the boat, and so they true a barrier that excluded the children from the set, pushed
them out, but because they still uphold their place in their chart -- this is kind of consistent
physics metaphor -- we get clusters. So you can see in the lower-left corner, there's a solo red
dot. It's the only girl in this data set to die, and similarly, there's a solo red dot among the women
in first class. This is a mother and daughter that this person was able to find because they're
outliers on this four-dimensional split. They wouldn't have otherwise noticed it. I might add, a
lot of our participants, this was kind of a more environmentally biased sample, so we did not
have a lot of college students. A bunch of our participants had never even picked up a tablet
before and were still able to do this sort of task very quickly. Looking at participants' findings,
in general, Excel users excelled at these two types of findings. We coded them with two
different raters with pretty high reliability. Point findings are this particular data point is aged
40. Statistical findings, the average age is 50.1, whereas Kinetica users were much more able to
do descriptive things, so comparing. More women survived than men. There's a relationship
between age and survival. Older people tended to die more often in the data set. And descriptive
things like there just seemed to be more men than women in this data set. This speaks to a more
holistic and general understanding, while they could not necessarily get down to quantitative
features. Going all the way back to the old Perer and Shneiderman paper I mentioned about
broad exploration moving into statistical tools, you can imagine this being a sort of wave finding,
where you identify interesting areas to further interrogate using quantitative means. Since
Kinetica, I've commercialized the technology as DataSquid, which has been really great, because
it's allowed me access to data stakeholders, people who actually use data in their everyday lives,
and I can go into detail with this more with you later. But this also led to a redesign of
interactions, so this is what DataSquid looks like now. And you can see, we focused in on giving
people plot at all times, because we realized the core benefit this was providing to people was in
terms of varying different representations. Context in the case of Kinetica/DataSquid means
giving people as many different views of the same data as possible, so they can build a better
model and notice more interesting trends, and doing this in a way that forces them to see
statistical features like distribution. You can see how these bulge in different ways and have
different senders without a box and whisker plot. In the future, I think there are some really
interesting possibilities. How do we show a million points in a small screen in a way that's
sensible to inexperienced users? And how do we represent the fact that if we're clustering
100,000 points together, that there's a stochastic quality, there's an uncertain quality to those
points? There is no perfect average for those 100,000 points with more detail. How do we
devote detail on the screen, devote more pixels to parts that actually matter in high detail where
we know the user may not be interested? Additionally, I think presentation and sharing is really
crucial in this sort of information visualization approach. All of our stakeholders for DataSquid
want to share this with others immediately, and I think this has really interesting potential. How
do we help people curate data presentations in a meaningful way while they explore or right after
they explore? What sorts of things are people choosing to present? What aren't they presenting,
and what's the best medium to share? Do we demand our users do this live in person and speak
it aloud? Can we generate static visualizations that we pass on to other people after the fact?
What modality works best for conveying information? And of course, what does physics look
like for correlation? What sorts of complex analyses have metaphors in the easy-to-use physical
world, and what do those look like? So context in this case means context about your own data.
How do we help give you detailing representations of a data set such that you can find interesting
features and build an understanding to direct your decisionmaking? In particular, we find that
this DataSquid tool is really good for Yelp data, helping people pick their favorite restaurants,
because of this underconstrained, I don't even know what I'm looking for quality. One theme
that's emerging out of all my work moving forward is this idea of use -- how do we move from
seeing to employing or using and acting? So I propose this creates a really virtuous cycle. If we
work with stakeholders to understand data and develop new data visualization solutions, we can
improve the kinds of contributions on Wikipedia, or the kinds of findings people make with data,
which in turn gives us better data to generate new systems. There's a really virtuous cycle
inherent in this process. In Wikipedia, you can imagine being prescriptive, telling people, here's
an interesting area to contribute. Here are some things to consider if you want to contribute up
there. If you're breaking norms, these are the people you should probably talk to before you do.
We can do this in more communities, such as web forums. You can imagine capturing what is a
flame war and modeling that. On open-source projects, we can capture, what makes a good issue
request? Who is contributing to a certain part of the code, and what do you need to know in
order to make a good contribution there? In crowdsourcing, you can imagine moving towards
being prescriptive to task organizers. Here's how you should redesign your task. Here are some
stakeholders in the crowdsourcing market that would be really good to talk to or have domain
expertise that would be really well suited to your particular project. And we can expand this to
contractor or creative-type markets. Identifying expertise becomes a really critical concern as
the work gets bigger, and understanding the kind of work and the process of work becomes
increasingly critical. In the multivariate models, as I mentioned before, scaling up and thinking
about presentation is important, but also what does a physics-based visualization tool look like
for graph data? How can we extend this sort of approach towards naturalism to new data types
and keep people rigorous so they avoid the problem of seeing so many t-tests, one is always true?
How do we make sure they have an adequate understanding, even if they're an experienced user
of statistical reliability? So with that, I'd like to thank you all very much for your time, and
thanks all for hosting me. I'd love any questions. Yes, Erin.
>>: So at the beginning, when you were talking, you had an example with Zillow as being a
hard-to-use for certain kinds of things. I was just wondering if you had had any thoughts on that
in particular in terms of what the solution might be for that kind of ->> Jeff Rzeszotarski: Yes, so we're actually running a between-subject study right now,
evaluating kind of a Zillow-type home buying task between a traditional interface like that,
something more involved like Tableau and the Kinetica prototype. The idea here is that if you
don't necessarily know what features you're looking for, because DataSquid/Kinetica shows you
a bunch of different representations really easily, you can help triangulate on breakpoints in the
data. Maybe neighborhood actually is a feature I care about, because it really cleanly breaks the
data into want and don't want. Another area we're looking at with the Zillow task is annotation,
right? If you're doing a lot of different analysis steps and different representations, giving you an
ability to carry through information from each of those different representations is important. So
maybe I tag things that are cheap and in a nice neighborhood with a red color and then go
through and say, well, you know, these are decent parking areas on the geographic view, tagged
as the blue color, and at the end, be able to collapse your information down. I think in general,
the only way we're actually going to be able to zero in on what makes a proper kind of Zillow or
customer decisionmaking tool work well is by running a bunch of A/B and exploratory studies
like what I'm doing now, trying to get at, piece by piece, part of the data by part of the data,
where is the benefit coming from? And even in Kinetica, interaction by interaction, why is the
tool performing better than existing ones? Yes.
>>: You mentioned a little bit at the end there about scaling up the size of the data. Are there
limits in your mind, and if so, what are they, do you think, for both the sort of physical
interaction and analog, like in Kinetica, but also the cognitive analog, like in the first stuff, where
you're saying we can pull out topics and things that map to people's mental models. But do you
think that millions or billions of data points -- where do you break down?
>> Jeff Rzeszotarski: So something that's kind of lurking behind the scenes in the Kinetica slide
I showed is that data are not often perfect or even good quality. Data are noisy. Data come from
varied sources and need to be brought together, and that problem only magnifies, the larger in
size you go. So I think one way to start tackling that problem is to think about machining
learning and aggregation. How do we aggregate points into collections in a way that makes
sense? One early prototype I did while ideating about this for Kinetica was to apply a
hierarchical clustering based on data features to the points, so the idea is that we can bring all the
points together or we can selectively break them apart if we know a user is interested in a certain
set of points. The challenge in all these approaches, I think, is maintaining focus and context.
How do we show the user a lot of detail about what they care about but still represent the other
stuff in a faithful way that leads them to contextualize what they focus on? And that's where I
think we may hit limits. So if you have a million points on the screen, the focal area may only be
10 points, and there may be 999,000 points being condensed in some way. We may not be able
to condense those in a meaningful way, especially if the data are noisy. And so another way to
take the work is to also think about other modalities, like natural language querying. How do we
help people interrogate the data not just through this digital medium but also through querying it
properly in their own language, reciting results back when it makes more sense to be quantitative
than visual, mixed-media systems, almost. I wish I had an immediate answer, but a lot of my
work in this sort of problem space has been around conducting design ideations, building
prototypes and testing to explore these issues, because I've found that I've learned the most by
just constructing systems that start to do this, that raise those salient issues to the top, learning by
building. Yes.
>>: Sort of a follow up to your last point, you focused a lot on exploratory tasks, but how do you
think about balancing a broad range of tasks, so on a real estate site, it may be that what people
want to do is monitor the price of a property or look in one particular neighborhood.
>> Jeff Rzeszotarski: Yes.
>>: And in search, people do a lot of very simple things, so how do you balance that with the
broader complexity, and how do you get people to go smoothly from one to the other.
>> Jeff Rzeszotarski: Yes, and this is even true in Wikipedia, where an administrator wants to
track changes over time. So yes, I think the talk is focused a lot on exploration. Those are kind
of the interfaces that I've largely focused on. And the challenge comes I think in changing
context, like you were saying. We can develop a prototype that really adequately helps people
do directed search or keep track of prices over time, but it's a transitional moment. That's the
really hard part. So I've been actually prototyping in the Kinetica line of work time series data,
and so how do you transition from looking at one single window of time to data mapped over
time? And it turns out the transitional point is really, really, really hard to get right, because we
want people to know the consistency, right? You were looking at one window. Now, here's that
window in a larger context. And doing that with the kind of fluidity that users expect in this type
of interface has proven extraordinarily difficult. Have the points moved based on their time such
that they're consistent? Do we show kind of multiple representations repeated? In some ways,
they're domain specific, so I think the tools that I would use to support a Wikipedia administrator
aren't the same as what I would use here to show data over time in Kinetica. But that issue of
consistency and learning what kind of tasks they're doing, whether it's actively or passively, I
think are the core questions. Whether you have the user declare that or we try and infer it from
their pattern of use, using the crowdsourcing work that I've done already on behavioral
monitoring, that's a super-open question, really exciting. Yes, Erin.
>>: [Indiscernible] like Kinetica and DataSquid, what do you feel is the typical adult who's not a
PhD in computer science's capacity to understand -- to have the right skills to understand how to
interact with those, and do you feel that there's any need for more research or tools or curricula
on how to educate people to interact with data?
>> Jeff Rzeszotarski: I think this is a tension in the Kinetica work, where we want to give this to
very inexperienced users because we know they can begin to use it quickly. But the question is
whether they can use it rigorously or not, and how much statistical education they need before
they're appropriately prepared to make findings using it. And that's a tough one. On one hand,
one of the design principles behind Kinetica is trying to make the affordances we use for
exploring data push you in the direction of statistical validity. So surfacing distribution at every
step of the way through the way the points bunch up, showing filtering visually so you know
how much is being filtered out, so you don't focus on two points when there are a couple hundred
that you're excluding now. I think there are some visual and interaction ways to keep people
rigorous, but that's still not enough. We can get a 45-year-old participant who has never touched
a tablet before using this in five minutes, and they can pick out a house for them, but is that
ethical to have them look at something for five minutes and find data points that correspond to
such a big decision and not necessarily understand the ramifications because visualization can
read as so authoritative? Jaime and I were talking earlier about the difference between text and
visual, and how people are pretty well prompted at this point to deal with text resources, rank
search results, and know that the top may not actually be the top. That's not necessarily true
when everyday people interact with visual systems like this, and so part of it is kind of stamping
expectations lower, saying, yes, this is one way to look at the data, but this may not be the full
way to do it. And I'm not really sure yet what that looks like. Do we design more systematic
things to say, hey, wait a minute, you've got to check these things before we actually go through
with any decision. Do we just adjust the interface so it won't show you things if it's uncertain?
It's a whole continuum that I'm not quite sure about. This is a problem even in Wikipedia, where
people may not be able to interpret the syntax that people use in discussions when they're
negotiating down the page. If they reference WP:Peacock, do you know what that is? It's
actually a policy that says don't use exaggerative words, but how do we make sure that we level
these things properly?
>>: [Indiscernible] the points to stay on the screen even if ->> Jeff Rzeszotarski: I see money there, but that is real ->>: I said get that thing out of there.
>> Jeff Rzeszotarski: But I think that's the sort of tension this work sits in. And I don't have an
easy answer for it, because I think it's a really hard design as well as systematic question. Yes.
>>: Do you know, are there any good data -- directive data exploration tools that are geared
towards kids, for example, as a ->> Jeff Rzeszotarski: That's an area that I'm not super familiar with. There undoubtedly are, and
I really need to look into that to understand the issues I think you're talking about. Because I
would assume that's the place where you start.
>>: Here, that was interesting, too, because -- so I know one of the things people often say about
Wikipedia is like, oh, kids shouldn't -- a lot of schools have rules like, kids shouldn't cite
Wikipedia as a source in their reports for school, because it's not authoritative or something,
right? And I thought that in Danah Boyd's book that she wrote last year about teens' use of social
media, for some reason, it wasn't really germane to the main point of the book, but in one of the
chapters was an aside about Wikipedia that I actually thought was one of the best explanations
I've ever read for why teachers should let students cite Wikipedia as a source in a paper. And it
was focused on the fact that all of the background parts of Wikipedia that people don't normally
read are actually really educationally informative for helping students, young kids like teens,
understand the nuances and subtleties of both the credibility of information and how it's
generated over time and what is and isn't controversial and incorporating all of Wikipedia and
not just the surface features could actually be really important, educationally. And I wonder,
seeing your system makes me think about -- your system is designed for adults who are
contributing to Wikipedia, but I wonder if you know of or have any thoughts about tools that
would actually help middle or high school students who are consumers of Wikipedia to be more
informed consumers of some of this background content in a way that would enhance their
education.
>> Jeff Rzeszotarski: Training internal filter better, if nothing else. I'm not familiar in the
Wikipedia context of any systems like that, but that's a really interesting angle to take this. And
then the question becomes how do you surface the right -- because I think the contextual
information you want is then a little bit different. You may want to bias it towards successful
interactions, as opposed to people fighting and coming to no resolution, without any other
progress, or niggling over a very tiny detail and not an architecturally important part of the page.
This is something I really had not gotten into much in the talk, but curation underlies a lot of this
work in the sense of what information you choose to present and why, because inherent in
constructing these topic models are features that raise or lower certain parts of discussion versus
certain parts of discussion versus certain parts of the page, and inherent in the crowdsourcing
work, what behavioral features are surfacing and why. And that really influences the end
conclusions people make, and I think that's dictated a lot by who the perceived audience may be.
It's an area I have not theoretically explored much and it sounds really, really interesting to dig
into. Awesome. Thank you, and thanks, digital people.
Download