>> Jaime Teevan: All right, so welcome. Thanks for being here, and thanks for tuning in, also. It's my pleasure to introduce Jeff Rzeszotarski. Jeff is currently finishing up his PhD at CMU, where he's advised by Niki Kittur. He is an HCI person who uses InfoVis solutions to solve problems in social computing and crowdsourcing, and Jeff has built some really interesting systems, including what is now known as DataSquid and published a lot of great work at CHI, CSCW, UIST, HComp and received a handful of Best Papers, as well, from CHI and UIST. Jeff is familiar around the lab, as he did an internship with Mary Morris a few years ago, and he's also an MSR Fellowship winner, and I know I and others here have cited his work in our papers. More recently, he also received the Carnegie Mellon Student Innovation Fellowship, so I'm looking forward to his talk today on what he learned about helping people make sense of data. And I also thought I'd mention for those of you online, I have the question-asking terminal, so feel free to ask questions, and I can say them out loud. >> Jeff Rzeszotarski: Awesome. Okay, hi, everyone. Jeff Rzeszotarski. So I'm just finishing my PhD now at Carnegie Mellon, and like Jaime said, I kind of work at the intersection of data visualization, social computing and crowdsourcing, and it makes sense to start my talk, I think, with data. So here are some interesting data sources that are pretty darn huge. So Wikipedia just recently passed 5 million articles in size. Mechanical Turk as I sampled it a couple of weeks ago has over 200,000 different tasks for workers to do, presumably organized by many people. And even on a consumer site like Zillow, we've got over 2.5 million data points, or in this case, houses you could purchase. And the thing about these massive data stores are, we actually are using these in our everyday life in a really meaningful way. So Wikipedia, if you're at all like me, you incorporate that in your everyday knowledge. Rarely a day goes by that I don't use Wikipedia. Mechanical Turk is used in products that face us. These were quality assurance, search, translation approval, corpus building, all sorts of things in the back end. And even a site like Zillow, it's ending up being used for one of the most important decisions in a person's life, buying a house. This is serious economic impact. The problem is, these aren't necessarily perfect. Wikipedia has an overwhelming amount of historical data. If you think of a Wikipedia article as sort of like an iceberg, you can see the article on top, but there's gigabytes of historical data underneath it, collaboration by thousands of people that's hidden from you. Mechanical Turk, if you're organizing one of those 200,000 different tasks, you get a stream of raw results back, and it's not necessarily obvious which ones are good, which ones are bad, how they all contextualize together. And even a consumer site like Zillow, it's rare that a person making a decision exploring data has a perfect criterion or makes the perfect optimal choice. It's always an involved, exploratory process among a lot of data. So the common thread among all these different domains is that that there's really no one-size-fits-all solution here. We need people to understand the data, understand context, be able to make an informed evaluation of history or which purchase or which house to buy or which Mechanical Turk task is the good one. But there's no real easy way to do this. And so what I point to in my line of work is the idea that we need to support data exploration. We need to help people not only find the thing that matches their expectations but in fact understand what their expectations are in the first place. And so my core guiding principle in my work is really how do we design systems that help users see and use context in complex data? And this idea of context is really important. If you think about when you're making a decision or trying to make sense of data, you can't just come into the data source and know immediately what it is you need to see and why. You need to see a bunch of examples, a bunch of counterexamples, to build up a model, to build up an understanding. And so in my work, I focus on a number of different domains, identifying how we can surface context to help people complete complex tasks and make sense of data. This might mean that people can perform more quickly or more effectively. They may be able to explore more features in the data at once, or their findings might be better. They may have more satisfaction after exploring this data. At this time, I want to draw a distinction here between directed and exploratory tasks. So what I call directed data tasks are ones where you kind of already know what you're looking for. If you think about Google Search, you have some search terms. It is well able to fulfill those requirements. If you have discrete criteria, systems right now are very effective at giving you exactly what it is you're looking for, whereas exploratory tasks are much harder to afford. If you think about trying to explore, it involves building a mental model. You don't come in with a perfect representation. You build that representation from the ground up. And only then once you've built a model can you generate insights, which are different from a perfect decision matching criteria, and your decisions themselves are kind of integrative processes. It's not this matches, I'm done. To link this a little bit to literature, we can consider what cognitive science has done with sensemaking. So sensemaking is the process of constructing meaning from information encoded in data. And Weick I think describes this really nicely as an iterative process. You're developing a mental model, but you don't come into the data with it immediately. You iteratively build it up over time and over exploring. Pirolli and Card have a really evocative term, which is foraging. You're searching around, for example, so you're searching for necessary data to build up an understanding. And Perer and Shneiderman add a really nice complication to this, which is all of us in this room are familiar with using statistical tools, kind of specific analytics, but what they found is that when you're making sense of data, very often, you have to start with a broad exploration before you can even apply those statistical tools. So you can imagine a narrowing, where you start with an exploration, where you need to make sense of data iteratively, and only after a while can you actually apply statistical techniques. To illustrate this even further, let's look at a sensemaking model, so this is Russell et al.'s model, which I think is pretty effective in conveying what sensemaking's iteration is all about. Imagine you have some data and you come into this task with a little bit of understanding about the data. I maybe have a task that's telling me what to look for already. In the green box, you start searching the data for good representations. In other words, you're trying to find data that match what it is you think the data is all about. So if you have an example, if you're looking for a house to buy, you may already have some existing criteria that you can find houses that match that ideal, as you're trying to make sense of what's out there. In the blue box, you take those examples, those representations, and you encode them into your mental model, so in this box, what you're really doing is taking those examples and asking yourself, do they fit in my model? How do they fit? Do these match my understanding well, and if so, that's great. But if not, you end up with residue, that red arrow. So as you explore data, you're not going to find every single example fits your mental model. In fact, some things don't fit your mental model, and as you explore more and more data, you build up this residue, stuff that doesn't fit. Eventually, you hit a crisis point where you have to adjust. You have to change your thinking about the data so that you can accommodate this residue. And so if you think about this process more holistically, we have iteration where you're trying to find examples. You try to find counterexamples that are signals that you need to switch your understanding, and over time, you accumulate less residue, because your model is getting better and better. This points to a couple of different ways that we can improve sensemaking for data explorers. In particular, we could help them find those representations better, direct them to ones that either confirm or disprove their hypothesis fairly quickly. Similarly, we could also improve the iteration of the process, so how much data can you cover in a given loop of this process? This can help people make better explorations and develop better mental models, but it's even more complicated than that. If you look at decisionmaking with data, people rarely get the perfect optimal choice. In fact, most everyday decisions are made without a full examination of all the available options. The best option may be missed. At Carnegie Mellon land, we call this [satisficing], the idea that you make a decision with the best possible constraints you have. Maybe you just don't have time to find the perfect one. Complicating this even further is the idea that we have physiological limitations on our data sensemaking process. So working memory is only seven-ish units in size. You cannot store a huge amount of data in your head as you're performing a task. We have limited attentional resources, so you can't focus on too many different targets as you're exploring data without getting overwhelmed and your performance degrading. And even things like feelings of self-efficacy, expecting that you're going to do well in a data sensemaking task, actually impacts your performance. If you feel like you're going to do a good job exploring data, you actually do. So to operationalize this a little bit, let's look at buying a house in Pittsburgh. If you're going to be buying a home in Pittsburgh, which is a data analysis task, and these numbers actually are reasonable for Pittsburgh, for those who are not necessarily believing me -- it's amazing. So you often come into a data task like this with some existing expectations or understanding. In this case, I have bedrooms, baths and budgets, but as you look at some more data, you realize there are some criteria you didn't expect to see but really do care about. So in this case, maybe you realize parking is an issue in Pittsburgh, and you really care about a nice neighborhood. However, as you keep exploring, you find, you know what, a nice neighborhood is actually pretty expensive. I want to live in a nice place, but it's going to cost me, so I have to re-adjust some of my existing criteria to match. And as you keep exploring, you end up accumulating a lot of different criteria, which speak to a really deep understanding of the data. But as you can imagine, trying to find something that matches all of these is a really hard process. So this idea is really characterized by exploration. You don't just find the point the point you're looking for. You build up a model as you explore by seeing a bunch of contextual examples. There's this idea of hypothesis testing. You think to yourself, all right, well what about a nice neighborhood? What would that look like? Well, let me experiment and see what that might be. It's an active, iterative process, as I mentioned earlier. So switching gears, let's look at how it goes with an existing interface that consumers may actually use. So this is Zillow's visual tool for identifying houses. You can see each red dot is a house in Pittsburgh. And I can actually pretty readily, using their faceted browsing tools, pick out some criteria, and I can encode these directed criteria really easily. The houses will disappear. The ones that still match my criteria will stay there, but if I want to ask some questions of the data, like what if I want more bathrooms, what if I want a bigger place or a smaller place, I have to go through a lot of different interface steps and then see what's appearing and disappearing in order to gain any understanding. This process has a disconnect with what people are actually doing. For known goals, that interface is really good, but for exploration, it gives you really hard feedback. Either points are there or they're not. You don't know why they disappear or why they appear, based on your filters. Interactions to explore and test different values involve a lot of different steps. And so what I point to in my work are ways that we can surface context in a really fluid, natural way that's relevant to the user so that we can accommodate exploration and deeper decisionmaking, and I'm going to do that in this talk in three different domains, where I focus my work. So to begin, let's look at Wikipedia. I mentioned earlier that Wikipedia pages are sort of like an iceberg, and here's an example of one page. I've actually contributed to this, though you may not be able to tell just from looking at it. And if I asked you, who's contributed to this? Are there any viewpoints that are particularly strong on this, what are the cultural background or the gender of the people who've contributed to this page, were there any debates going into it? You wouldn't be able to tell, just by looking at its current state. Wikipedia is an immensely collaborative artifact, but the collaboration is hidden from most everyday users. To get to some of this collaboration, because maybe you're going to make a contribution, and to make a successful contribution, you need to know what's already been tried, what issues may be present that you can't necessarily see immediately? So one thing that we can do is go to the little button in the upper-left corner, Talk, which is discussion among editors, and for a big article, this actually poses a serious barrier. So here's a discussion among editors for Scientology, which is a controversial article. This here is several dozen pages long, and you'll notice at the top archives. There are 30 different archives, each potentially dozens of pages long, all containing discussion among editors. So if I asked you, what are people talking about in scientology? What should we do or not do based on discussions in the past, you wouldn't really be able to say that much. Maybe you could use the search box to search for the word "cult," because you think maybe there's a debate about that in the past, but it's not necessarily clear what you should glean from this data store. Instead, maybe let's look at the past revisions, the past things that have changed in the page in actuality. However, for an article like the article Abortion, just the diffs of changes authors have made over time is roughly 20 copies' worth of Pride and Prejudice in length. So we can't exactly expect you to dig into this content, either. So you can see the sort of difference between exploration and directed. You could search the discussion pages, you could search the revision, for a specific term, but if you wanted to gain a general understanding, you really have no ability to parse through this data. It gets even worse. So we conducted interviews with three expert Wikipedians in the Pittsburgh area, and one of the things they pointed to immediately was this idea of conflict. People are fighting on Wikipedia. They have zealotry about certain topics, and for newcomers especially, this could become a serious issue. If you wade into a conflict zone unexpectedly, your work is going to be thrown away, perhaps in a hostile manner, and you'll never come back. One of our experts, in fact, has received sort of a Wikipedia version of a no-contact order, because after a battle in one particular Wikipedia article, they were stalked by another editor. The more interesting part, coming out of these interviews, is about information overload, so we asked all three of these Wikipedians, what do you need to do to make a successful contribution? And they all said to us, well, once you have a region of the page you're interested in, you want to check the discussion and check past edits to see what's happened, see what sorts of discussions you're having there, who the stakeholders are, if there's any conflict. We then asked them, go ahead and make an edit for us, and we gave them a couple of editing tasks. They did not use any of those resources. Immediately after telling us history is important, they did not use historical resources at all. We asked them why was there this sort of disconnect? Why didn't they use them? And they said, there's just too much. I'm never going to be able to find out anything in a tractable amount of time, so I'm just going to try out and see and hope for the best. So we have an opportunity to do better here, but if you look at existing interfaces, we're still existing kind of on the high level, rather than digging into the actual substance of the conversations and activity, so Wiki Dashboard in the upper left shows you kind of temporal relationships of different authors, who's contributing right now and how often. History Flow right here shows you the evolution of the page in graphical form, but it's hard to get down to the what changes are people actually making and why level, and the lower- left corner, you see Snuggle here, which is a tool for administrators to socialize and interact with new editors, situated within the contributions these new editors have recently made. But Snuggle actually was appropriated as a tool to target new users for being kind of too engaged and kind of too inexperienced, so a lot of the administrators who saw these new edits in fact were throwing them away, saying, nope, this isn't ready yet. You need to do more. Because they didn't understand necessarily what the edits were actually doing in context. So with this problem kind of in mind, we took all discussions on Wikipedia, the particular article, and used topic modeling to situate them within a small section of each Wikipedia article. So the idea is, as you're browsing Wikipedia now with this Discussion Lens tool, within each section, you can start to see important discussions for that section that led to its evolution. So for example, if you're on a particular contentious article, which is the article on hummus, the food -- it turns out it's a very conflicted article on Wikipedia. When you're browsing the section on etymology, it's important to know about this discussion right here. These are people debating whether the Oxford English Dictionary is an authoritative source for the origin of the word hummus. Is it a Turkish word? Is it an Arabic word? It turns out that if you invoke that it's being a Turkish word, these authors are going to strike you down, and we see that throughout the history of this article. So now, if you are reading the article on hummus, you can imagine seeing that as one of the recommended related discussions. You'll be aware that this is an issue that you may not want to wade in. So we're cutting down the complexity here by surfacing relevant contextual discussions within a much smaller section of the page. And in practice, this does make a difference, so we asked participants in a lab study, between subjects, either the Wikipedia interface or our interface, in blue, to write a guide to an article section. And the idea was, if you were talking to your friend and telling them what they should contribute to this new section, what are some openings, what are the stakeholders? Are there some issues you should avoid or consider? For a small article, like the article on hummus, our tool really isn't reducing the complexity, any. People are just as able to write decent-quality guides based on history, but for a large article, like the article on Alan Turing, the tool, by crunching down and providing just contextually relevant information in a given section, actually does provide users a much better picture of historical data. This is on the reader's side. I'm just now for CSCW working on patching into this what the editor side looks like. So based on this feedback, do you actually make a better contribution to Wikipedia, knowing this extra information? This is the question we're looking at right now. So I talked about discussion, but what can we learn from past contributions themselves? So I've also done some work modeling past contributions to Wikipedia, and a lot of times if you look at it, this is a stream of different contributions to the article on Scientology, you'll see comments like this -undid revision by person. The idea is a lot of times in Wikipedia, work is just thrown away wholesale, either because it violates norms as part of a conflict or isn't wanted by a particular community surrounding an article. And so we constructed engram models to try and get an understanding of what sorts of content are valued or not valued by editors, using machine learning to cut through the complexity of this large historical store. So for a simple change like this, changing jumps over to walks near, we can construct a feature vector that captures the changes they made and considers whether or not that edit was accepted by the community or rejected -- in other words, reverted in Wikipedia parlance. And if we do this over 150 different articles, it turns out that we can actually pretty accurately predict whether or not contributions are likely to be thrown away by the community, just by the words that they're choosing to change. So the idea here is not that we should just tell a person, nope, yours is going to be thrown out, yours is. Rather, that we can use this model to gain an understanding of what things are particularly risky to do to a Wikipedia article. So this is model weights for the article on genetic engineering, and so you can see at the top, dude is definitely something that you should not contribute to the Wikipedia article on genetic engineering. Surprise. But maybe more surprising is shouldn't. It turns out shouldn't is a prescriptive term, and Wikipedia's neutral tone does not allow that sort of language by policy. The header you see there, genomique engineer, there was a debate in the article about whether this header should be included, and so our model picked up on the fact that that was a conflicted area of contribution, whereas of course Monsanto is a lessrisky term. Interestingly, exceedingly and involves, depending upon context, could either violate Wikipedia policy or not. So the idea here is that this model is capturing some really interesting features within the textual data store. And so the future I think is a really interesting possibility. In the first line of work, I was looking at how to collapse different discussion to editors while they edited in a particular article context, and you can imagine making this into a sort of recommender system. If we know you're editing a particular section, we also know the changes you're making, how do we present relevant data to you that's actually going to change the kind of contribution you make? How can we be prescriptive and say to people, we noticed you're adding Turkish to this particular section. That's no good. Here are some examples for why, and here's some discussion that's relevant -- to actually improve their quality of work. Similarly, we can use history to direct people to new interesting areas to contribute. If we notice that we don't have much recommendations for a certain section, there isn't much activity, maybe that's an easy entry point for a legitimate peripheral participation as newcomers socialize. And perhaps most interesting to me right now is being able to construct FAQs and guides dynamically based on historical data. So if we know what's going on in a historical article's section, can we construct a guide for that section just from the kinds of comments people are making as they change work in that section and the kinds of discussions that are happening? Go ahead, Mary. >>: So I'm just wondering, so I like these ideas, but I also see how, particularly let's say with Wikipedia, where there is a certain pervasive culture among the editors, which is very exclusionary and presents certain points of view, and these tools and ideas you have might help someone to fit into that culture and better operate within it, but doesn't maybe address the larger question of how to effect change beyond that. Do you have any thoughts on that? >> Jeff Rzeszotarski: And so I think this is actually really where I think we can effect change. So this first one does run the risk of only enhancing the orthodoxy of Wikipedia, because we're telling people, avoid this, it's dangerous. It doesn't change the dynamic at all. This work draws on work on Wikipedia socialization, which generally says that newcomers tend to go to Wikipedia articles, make one contribution, get thrown away and never come back. The barrier for Wikipedians is in fact these early edits. And once they've socialized a bit more, they can handle wading into riskier areas and taking stronger viewpoints. And so one core possibility for investigating historical data is to provide better entry points. So these may be lower-risk entry points, in terms of an article section that hasn't been contributed to that much, but this gives people hooks to begin effecting change, and can help bring in more diverse audiences. In particular right now, Wikipedia is suffering from a gender and culture problem. It's predominantly white males who are contributing. And if we can provide better entry points that are less hostile for a variety of different contributors, maybe we can start to change the culture in that manner. Yes? >>: I was going to ask, it's sort of a related question. So yes, the statistics are pulling out a space that reflects the culture or the people that contributed. >> Jeff Rzeszotarski: And it's all temporal, of course, too. >>: Yes, so my question is sort of specifically about the system that did the topic modeling. Is there a way to weight the importance of the topics? Do you bake that in? >> Jeff Rzeszotarski: Yes, I've got it here. So there's a star in the upper-right corner or something. We're playing with this out. We prototyped this out to finish it up for CSCW, is collaborative filtering. So the idea is that if people actually find this information valuable, we can also use that to operate it, and you can imagine constructing more meta information out of this, so maybe as people read discussions or as people close discussions, we can construct summary information that's more condensed and more relevant. I didn't -- yes. >>: So once you're established, then people can go and rate. >> Jeff Rzeszotarski: Kind of in a post hoc way. Also, I haven't really even talked about temporal issues, which are also an interesting question. Do things decay over time or stay consistent? And scale -- Wikipedia pages, singularly, like I've been discussing now, or all of Wikipedia as one model? So this is one particular final vision. This is a prototype we're thinking about right now. You can see we're giving people real-time editor feedback in the left bar unobtrusively. They get more information about what they may or may not want to add. So that's thinking about context in Wikipedia as the historical context hidden beneath the page. How can we expose that to people in a tractable way so that they can make sense of data, and I'm using ML and Vis to help get us to that point. Switching gears to crowdsourcing, I think everyone in this room is pretty familiar with existing crowdsourcing marketplaces, including Mechanical Turk, which is more micro, and Upwork, which is more contractor organized. In short, crowdsourcing workflows kind of follow this pattern. Imagine I've got a big corpus of images -- in this case, adorable puppies -- and I want to tag each of these. I could one by one go through each animal and tag it, or I could give each picture to a single person in parallel and have them all do it. This holds a really interesting possibility for getting a lot of human judgments really quickly and scalably. The challenge is, not all results that you get are good, especially when economic motivations, like in the case of Mechanical Turk, start to come into play. When people are extrinsically motivated, they may try and find ways to game the system. So I asked people to tag that image, and you can imagine getting really eager answers, answers -- and we asked them for three to five tags. They gave us three or no tags at all, hoping we won't notice that they didn't give us any work. This is good, because these people then are making the most possible hourly wage for the least amount of effort and hoping we won't notice. Some Mechanical Turk workers call this cheating, the idea that they're cheating you out of particular value by not contributing. And if you look at Michael Bernstein back in 2010, the find, fix, verify paper, they pegged it at 30% of submissions are of that so low quality you can't even use them. These days, 10% to 30% is about the rule you should use. So you think, then, we've got to figure out which one -- based on each submission you got, is it this, this or this. And if you look at the existing Mechanical Turk interface, this is what you find. I asked them to help name my company, so each row here is a list of company ideas brainstormed by a Mechanical Turker. You notice I can get the names they've given me, so the raw data they gave, their approval rate -in other words, have I kind of reputation system-wise approved them before, and if I dig really deep into this interface, I can also see whether they worked for a long time or a short time, but that number is known to be incredibly unreliable. So how do you find the good work? Existing research has looked at this in two different lenses. One is design better tasks, so people have to give you good work, which is really hard. This usually involves a lot of iteration and a lot of incentive design. You can also in a post hoc way analyze what you've got and try and find the good stuff. So one way to do that is by seeding your task with gold-standard questions. So the idea is if you already know the answer to some questions, you can put them into your tasks and see which workers get all of those right, because then obviously you trust their results more. If I asked you, though, what's the gold-standard example for a poem, what sort of restaurant review would tell you whether they're a good or bad worker, complex work, this all kind of falls apart. It's hard to understand more creative or more varied inputs. Also, I might add that workers are known to game this. CrowdFlower in particular has been known to have pools of workers who learn the gold-standard formula and only answer those properly. You could also have multiple workers redundantly do the same task. For instance, if you're transcribing a video, you can just pick the most common answer or most common substrings to get a decent transcription. Of course, there's no most common short story, and having six people do the task of one person adds a lot of redundancy into your system and can crunch down a lot of the diversity that really human judgment is valued for. So in this line of work, I propose a really different signal for evaluating the quality of work in a crowdsourcing workflow, and even just understanding in general how a crowdsourcing workflow is going. And that is thinking about the middle, between designing a task and getting your results. The way workers work can tell you a lot about not only their performance, but the task and workflow in general. And so to give you an example of what that looks like, let's consider two workers taking an ACT practice test. So you read a passage, reply to some multiple choice questions. Worker A accepts the task, scrolls down, clicks and answer, clicks an answer, is done. Worker B accepts the task, pauses here, scrolls down, pauses here, scrolls up, scrolls up, pauses here, clicks an answer, pauses, clicks an answer, is done. And I'd ask you, not even knowing what the answer to these multiple choice questions are, which worker did a better job? You'd probably say B. It's not a super-hard question, unlike these questions right here, which are pretty difficult. You see brouhaha here. That's pretty tough. We associate Worker B's behavior with diligence, right? Those delays were them checking the passage, and our knowledge of the task actually informs a lot about their end performance. So we constructed a model that measured workers' work behavior while they worked using clickstream data. So here's a worker typing, submitting hello, really low-level events. We had a bunch of workers complete different tasks, and from those low-level event strings, we extracted a bunch of general features that were more comparable across workers and more quantitative, so things like how long they spent, how much did they pause to think while they were typing, those sorts of behavioral features. We had workers do three different kinds of tasks, pick the nouns, tag an image or that practice test you saw earlier. And in practice, just looking at the way people work can really inform -- give us information about their end product. So calling out image tagging, we had two raters rate whether they thought the person was cheating us, or in other words, giving us bad results intentionally, and our model just looking at behavior got 93% accuracy in terms of whether a person was cheating or not, based solely on behavior. And we had two raters also rate quality in a five-point Likert scale for those tags. Our model just looking at behavior can get within about 0.5 on a five-point Likert scale of human ratings, just looking at behavior, not even considering the end tags. So this is really interesting. This gives us the idea that behavior is a really valuable signal for understanding quality of work. But it also neglects to consider a lot of really interesting features. So right now, we're crunching behavior down into just a simple outcome measure, pass/fail or rating. What about individual variability? What about different ways of working or different cognitive strategies? So in building on this work, in the second paper, I can start to do visual metaphor for these sorts of traces of activity. So here, the blue tick marks are people clicking on something. The orange lines are people scrolling up and down the page. The red boxes are people typing. You can see this person paused in the middle of typing, and black lines, which you won't see here, are changing focus or tabbing to a new tab. The idea here is now we can actually go a bit deeper than just good or bad, so I'm going to say who did the poor work? But now, why? You probably would say, A, these are people doing the ACT practice question, but A didn't do the poor work because they had a shorter trace. In fact, time spent on task is usually a poor indicator of performance. Instead, what you see in B and C are this upside down V pattern. These are people who are reading the question and then checking the passage for the appropriate answer, then scrolling back down again to pick their answer. We're now actually getting to their behavior, in other words. This visual metaphor lets us understand not only did they do good work but start to explain theoretically why that may be. We can find out some really interesting things by looking at worker behavior. We asked them to tell us their favorite color and use a color picker to pick it and then describe it to us. There's a perfect, or a near-perfect correlation between the delay they spent before they told us the color name and the length of the color they gave us, because they were picking the perfect shade in the color picker. More operationally, we asked people to translate from Japanese to English a particular passage, and one of these is not like the others. Only one of these has red blocks indicating typing, so only one of our workers, in what I believe was a 10 or 20-worker pool, actually did any typing while they were completing this translation task. Everyone else used a translation service, and so the most common answer to this translation passage is this one, which if you can read Japanese does not match all that closely, and if you can read English, you can tell is not terribly sensical. This is the worker in green, the one who stood out in the behavioral traces. It turns out, they actually used machine translation, as well, but they took effort to proofread and correct the machine translation before they gave it to us. So this still wasn't a perfect result, but we were able to pick the best result possible, which was not the most common one. These timelines were part of a much bigger interface that lets you triangulate down on the relationship between the output workers gave you and a number of different representations, their behavior and some quantitative features of valuing their time, how much thinking they were doing and things like that. The idea here is that you can pick a few interesting behavioral traces out and use distance and ranking measures, then use small ML algorithms to pick out more people like them, so you can iteratively build an understanding of different kinds of strategies or different kind of working habits among your workers. Yes. >>: Question. This is [Bob Frankel]. What's your sense on now this would generalize? For example, if you provided an API and people in a schema, if people could just fire data at you from any domain and just say, look, here's operationalized by me those different metrics on some write an email or doing whatever task? What's your sense? Do you think it would work? >> Jeff Rzeszotarski: Yes, so A, we don't know, and this is something really interesting to me, is kind of getting beyond just performance to thinking about, are they ESL? Do they have domain expertise? Are they checking their email well or not? Things beyond just this pure labor pool performance. In the initial ML work, we had really good results porting the model, so we could take a model from one place like notification and actually accurately predict whether a person would pass that reading comprehension test. So it points to maybe there being archetypes for different kinds of tasks, but a lot left needs to be done. You can think of similar behavioral or interaction patterns between multiple classes of tasks. It's most certainly task directed as well as application directed, so there's probably some interaction of the two. I would really like to investigate further, in the lens past just pure quality. Additionally, you can imagine giving this right back to the contractors working, so can we tell contractors, we noticed you had the skill. Maybe you could find tasks that are more aligned with this that deliver more value. Or maybe we notice here you're getting fatigued. Why don't you try something fun? Here are some suggestions. So giving power back to the contractor through self-awareness. Similarly, we can actually give organizers a much better picture of what's happening, so in that ACT practice test, you can imagine telling organizers, hey, I noticed that the good people were checking their answers. Could you make your task design such that people check their answers by design? There's a really nice, fruitful cycle where you learn from the different strategies workers are employing on the fly as you develop better and better versions of tasks. So context in crowdsourcing markets really means discovering human behavior as people complete a task and surfacing that to task organizers, so that they can make a better adjudication about the performance or nature of their working pool. So once again, it's giving them the information necessary to understand and then act. In the last portion of the talk, I'm going to briefly touch upon a more general approach, which is helping people see more context in their own multivariate data. So multivariate data is something that I think we're all pretty common or have a good amount of experience with. Each row here is a brand of cereal. Each column is nutrition information in the back, and you can see, if I asked you, find the correlation between carbohydrates and sugar, you may intuitively know there is one, but going number by number may be too difficult, if you're just starting to look. So to get a better visual understanding of the multivariate data, researchers have gone in the direction of visualization techniques. So here's an early example of Film Finder, which charts out films on kind of a temporal and a rating axis. The neat innovation here is that you may not be able to see -- you may not want to see every single film on that chart. You may only have certain interest areas, and so we can use dynamic querying, these sliders over here, to filter down what you're looking for, and the stuff accordingly pops in or out of the screen. What if you want to see more than two dimensions at once without scrubbing that slider to see? We can instead stack charts, so now we have three different dimensions of data showing in these stacked charts, and we can use brushing to help zero in, because the attentional load for trying to find certain regions is quite high. If these weren't colored, it would be hard to figure out where the clusters lie, at least certainly the green and orange. So the attentional load is high. We can also use really advanced visualization techniques like parallel coordinates, which are really effective at seeing trends where values change abruptly. So each point is a row now, and it crosses these vertical lines on its values, so you can see that the orange tend to go all down, the blue tend to go all up, but if you were untrained, this could be pretty overwhelming, especially if you were an experienced analyst. And if these weren't colored, would you necessarily be able to see it through all the noise? So the core issues I'm identifying in these sorts of approaches, which I'm pointing to as limitations. These approaches certainly are incredibly valid and work well. I've used them. Hard constraints can make it hard to track values as they change over time, so as I move those sliders in Film Finder, stuff appears and disappears. Training can be a serious issue. And also, they can be high load. Those stacked plots are really hard to interpret at times. So interestingly, a really wonderful thing has come up in the past decade, where touch devices have become not only common but incredibly used by everyday people. Everyone owns a tablet or smartphone in America these days, at least a high proportion, which is a shocker. And these devices have a really nice property. They bind really closely interaction with response. They occupy the physical space of a person, and they also afford a really interesting potential in terms of naturalistic visualization systems. So I'm defining these as systems which employ interactive or visual affordances that resemble real-world phenomena. And the idea is, we can use touch and these natural feeling systems to get really close to users. We're leveraging their inherent expertise, so I know if I drop this, gravity will pull it to the table. And even further, this is a fluid thing, so it doesn't just pop from here to here. It actually fluidly transitions all the way down as part of its fall. These sorts of interfaces encourage a lot of experimentation and play. If you think about using a tablet, it's generally a playful experience. And so in this line of work, in the Kinetica project, I asked the question, what if we used physics-based approaches to help people explore multivariate data, leveraging this idea of fluidity and little training expertise, because people already know the models and using physics metaphors as applied to actual data processing. So to give you an idea of what I mean by a physics metaphor, here's one. This is a kitchen sieve. This is actually a really great filter for data. So not only do you see the particles that pass through the filter, the small cornmeal, you also see the stuff that didn't make it, and you encode in the process of shaking this filter out the act of filtering. It has really nice properties in terms of amount on either end and the action. So you can see here in this video, I'm doing the same thing to data now. Some things pass through. Other things don't, and I see both ends of the filter. And I recall it and encode it because it's an action I take. We can use magnetism to pull points of the charts, and we can emergently combine different physics-based tools together to get really complex data interactions. And so here we're filtering out some points, charting them and then highlighting some that match a criterion. To understand where and how these particular physics-based Vis approaches are good we conducted a small between-subjects user study, comparing Excel to this new approach. Participants first received training in either case. We thought these conditions balanced out. Participants tended to have more Excel experience, but Excel is comparably harder to use, so we thought there was some leveling going on, because Kinetica's training cost was much lower. Once participants were trained, they were given some basic stats questions to make sure they understood the technology. Then, they completed two different tasks. The first was, here's some data, find the perfect car for you. Here's some example criteria to go on. The second task was a set of people who were on the Titanic when it crashed. We gave users an open-ended exploration task -- find out as much as you can in this data set. Here's an example of what one participant found in the car-buying task. We asked them, what are you looking for in a car, and they immediately pushed all of the points out that were hatchbacks. Apparently, they really did not like hatchbacks, but you can still see them here. They graphed by weight, because they have a hypothesis that heavier cars do better in the winter, and you can see they encoded this three-dimensional sort to capture the distribution of power versus fuel economy and filtering out based on their budget. Interestingly, the participant did not just go and say, you know, this is the optimal one. It has the best mileage, or that Porsche up there is the best one because it's the most powerful. They gauged the bulginess of this distribution and said, I really want something more like that, because that point's in the middle of the road in a lot of different features I care about. It spoke to their deeper understanding, as opposed to just matching the criteria optimally and perfectly that they started with. Here's an example of a participant in the Titanic condition, and they're actually doing a four-dimensional query here, so points are being pulled to a particular place in the chart. In this case, we have cabin class and gender of passenger. They're colored by survival, so the people who died are red, the people who lived are blue. They already noted that more women survived than men. I'm sorry, this is kind of a macabre example. But they were interested in what about the children on the boat, and so they true a barrier that excluded the children from the set, pushed them out, but because they still uphold their place in their chart -- this is kind of consistent physics metaphor -- we get clusters. So you can see in the lower-left corner, there's a solo red dot. It's the only girl in this data set to die, and similarly, there's a solo red dot among the women in first class. This is a mother and daughter that this person was able to find because they're outliers on this four-dimensional split. They wouldn't have otherwise noticed it. I might add, a lot of our participants, this was kind of a more environmentally biased sample, so we did not have a lot of college students. A bunch of our participants had never even picked up a tablet before and were still able to do this sort of task very quickly. Looking at participants' findings, in general, Excel users excelled at these two types of findings. We coded them with two different raters with pretty high reliability. Point findings are this particular data point is aged 40. Statistical findings, the average age is 50.1, whereas Kinetica users were much more able to do descriptive things, so comparing. More women survived than men. There's a relationship between age and survival. Older people tended to die more often in the data set. And descriptive things like there just seemed to be more men than women in this data set. This speaks to a more holistic and general understanding, while they could not necessarily get down to quantitative features. Going all the way back to the old Perer and Shneiderman paper I mentioned about broad exploration moving into statistical tools, you can imagine this being a sort of wave finding, where you identify interesting areas to further interrogate using quantitative means. Since Kinetica, I've commercialized the technology as DataSquid, which has been really great, because it's allowed me access to data stakeholders, people who actually use data in their everyday lives, and I can go into detail with this more with you later. But this also led to a redesign of interactions, so this is what DataSquid looks like now. And you can see, we focused in on giving people plot at all times, because we realized the core benefit this was providing to people was in terms of varying different representations. Context in the case of Kinetica/DataSquid means giving people as many different views of the same data as possible, so they can build a better model and notice more interesting trends, and doing this in a way that forces them to see statistical features like distribution. You can see how these bulge in different ways and have different senders without a box and whisker plot. In the future, I think there are some really interesting possibilities. How do we show a million points in a small screen in a way that's sensible to inexperienced users? And how do we represent the fact that if we're clustering 100,000 points together, that there's a stochastic quality, there's an uncertain quality to those points? There is no perfect average for those 100,000 points with more detail. How do we devote detail on the screen, devote more pixels to parts that actually matter in high detail where we know the user may not be interested? Additionally, I think presentation and sharing is really crucial in this sort of information visualization approach. All of our stakeholders for DataSquid want to share this with others immediately, and I think this has really interesting potential. How do we help people curate data presentations in a meaningful way while they explore or right after they explore? What sorts of things are people choosing to present? What aren't they presenting, and what's the best medium to share? Do we demand our users do this live in person and speak it aloud? Can we generate static visualizations that we pass on to other people after the fact? What modality works best for conveying information? And of course, what does physics look like for correlation? What sorts of complex analyses have metaphors in the easy-to-use physical world, and what do those look like? So context in this case means context about your own data. How do we help give you detailing representations of a data set such that you can find interesting features and build an understanding to direct your decisionmaking? In particular, we find that this DataSquid tool is really good for Yelp data, helping people pick their favorite restaurants, because of this underconstrained, I don't even know what I'm looking for quality. One theme that's emerging out of all my work moving forward is this idea of use -- how do we move from seeing to employing or using and acting? So I propose this creates a really virtuous cycle. If we work with stakeholders to understand data and develop new data visualization solutions, we can improve the kinds of contributions on Wikipedia, or the kinds of findings people make with data, which in turn gives us better data to generate new systems. There's a really virtuous cycle inherent in this process. In Wikipedia, you can imagine being prescriptive, telling people, here's an interesting area to contribute. Here are some things to consider if you want to contribute up there. If you're breaking norms, these are the people you should probably talk to before you do. We can do this in more communities, such as web forums. You can imagine capturing what is a flame war and modeling that. On open-source projects, we can capture, what makes a good issue request? Who is contributing to a certain part of the code, and what do you need to know in order to make a good contribution there? In crowdsourcing, you can imagine moving towards being prescriptive to task organizers. Here's how you should redesign your task. Here are some stakeholders in the crowdsourcing market that would be really good to talk to or have domain expertise that would be really well suited to your particular project. And we can expand this to contractor or creative-type markets. Identifying expertise becomes a really critical concern as the work gets bigger, and understanding the kind of work and the process of work becomes increasingly critical. In the multivariate models, as I mentioned before, scaling up and thinking about presentation is important, but also what does a physics-based visualization tool look like for graph data? How can we extend this sort of approach towards naturalism to new data types and keep people rigorous so they avoid the problem of seeing so many t-tests, one is always true? How do we make sure they have an adequate understanding, even if they're an experienced user of statistical reliability? So with that, I'd like to thank you all very much for your time, and thanks all for hosting me. I'd love any questions. Yes, Erin. >>: So at the beginning, when you were talking, you had an example with Zillow as being a hard-to-use for certain kinds of things. I was just wondering if you had had any thoughts on that in particular in terms of what the solution might be for that kind of ->> Jeff Rzeszotarski: Yes, so we're actually running a between-subject study right now, evaluating kind of a Zillow-type home buying task between a traditional interface like that, something more involved like Tableau and the Kinetica prototype. The idea here is that if you don't necessarily know what features you're looking for, because DataSquid/Kinetica shows you a bunch of different representations really easily, you can help triangulate on breakpoints in the data. Maybe neighborhood actually is a feature I care about, because it really cleanly breaks the data into want and don't want. Another area we're looking at with the Zillow task is annotation, right? If you're doing a lot of different analysis steps and different representations, giving you an ability to carry through information from each of those different representations is important. So maybe I tag things that are cheap and in a nice neighborhood with a red color and then go through and say, well, you know, these are decent parking areas on the geographic view, tagged as the blue color, and at the end, be able to collapse your information down. I think in general, the only way we're actually going to be able to zero in on what makes a proper kind of Zillow or customer decisionmaking tool work well is by running a bunch of A/B and exploratory studies like what I'm doing now, trying to get at, piece by piece, part of the data by part of the data, where is the benefit coming from? And even in Kinetica, interaction by interaction, why is the tool performing better than existing ones? Yes. >>: You mentioned a little bit at the end there about scaling up the size of the data. Are there limits in your mind, and if so, what are they, do you think, for both the sort of physical interaction and analog, like in Kinetica, but also the cognitive analog, like in the first stuff, where you're saying we can pull out topics and things that map to people's mental models. But do you think that millions or billions of data points -- where do you break down? >> Jeff Rzeszotarski: So something that's kind of lurking behind the scenes in the Kinetica slide I showed is that data are not often perfect or even good quality. Data are noisy. Data come from varied sources and need to be brought together, and that problem only magnifies, the larger in size you go. So I think one way to start tackling that problem is to think about machining learning and aggregation. How do we aggregate points into collections in a way that makes sense? One early prototype I did while ideating about this for Kinetica was to apply a hierarchical clustering based on data features to the points, so the idea is that we can bring all the points together or we can selectively break them apart if we know a user is interested in a certain set of points. The challenge in all these approaches, I think, is maintaining focus and context. How do we show the user a lot of detail about what they care about but still represent the other stuff in a faithful way that leads them to contextualize what they focus on? And that's where I think we may hit limits. So if you have a million points on the screen, the focal area may only be 10 points, and there may be 999,000 points being condensed in some way. We may not be able to condense those in a meaningful way, especially if the data are noisy. And so another way to take the work is to also think about other modalities, like natural language querying. How do we help people interrogate the data not just through this digital medium but also through querying it properly in their own language, reciting results back when it makes more sense to be quantitative than visual, mixed-media systems, almost. I wish I had an immediate answer, but a lot of my work in this sort of problem space has been around conducting design ideations, building prototypes and testing to explore these issues, because I've found that I've learned the most by just constructing systems that start to do this, that raise those salient issues to the top, learning by building. Yes. >>: Sort of a follow up to your last point, you focused a lot on exploratory tasks, but how do you think about balancing a broad range of tasks, so on a real estate site, it may be that what people want to do is monitor the price of a property or look in one particular neighborhood. >> Jeff Rzeszotarski: Yes. >>: And in search, people do a lot of very simple things, so how do you balance that with the broader complexity, and how do you get people to go smoothly from one to the other. >> Jeff Rzeszotarski: Yes, and this is even true in Wikipedia, where an administrator wants to track changes over time. So yes, I think the talk is focused a lot on exploration. Those are kind of the interfaces that I've largely focused on. And the challenge comes I think in changing context, like you were saying. We can develop a prototype that really adequately helps people do directed search or keep track of prices over time, but it's a transitional moment. That's the really hard part. So I've been actually prototyping in the Kinetica line of work time series data, and so how do you transition from looking at one single window of time to data mapped over time? And it turns out the transitional point is really, really, really hard to get right, because we want people to know the consistency, right? You were looking at one window. Now, here's that window in a larger context. And doing that with the kind of fluidity that users expect in this type of interface has proven extraordinarily difficult. Have the points moved based on their time such that they're consistent? Do we show kind of multiple representations repeated? In some ways, they're domain specific, so I think the tools that I would use to support a Wikipedia administrator aren't the same as what I would use here to show data over time in Kinetica. But that issue of consistency and learning what kind of tasks they're doing, whether it's actively or passively, I think are the core questions. Whether you have the user declare that or we try and infer it from their pattern of use, using the crowdsourcing work that I've done already on behavioral monitoring, that's a super-open question, really exciting. Yes, Erin. >>: [Indiscernible] like Kinetica and DataSquid, what do you feel is the typical adult who's not a PhD in computer science's capacity to understand -- to have the right skills to understand how to interact with those, and do you feel that there's any need for more research or tools or curricula on how to educate people to interact with data? >> Jeff Rzeszotarski: I think this is a tension in the Kinetica work, where we want to give this to very inexperienced users because we know they can begin to use it quickly. But the question is whether they can use it rigorously or not, and how much statistical education they need before they're appropriately prepared to make findings using it. And that's a tough one. On one hand, one of the design principles behind Kinetica is trying to make the affordances we use for exploring data push you in the direction of statistical validity. So surfacing distribution at every step of the way through the way the points bunch up, showing filtering visually so you know how much is being filtered out, so you don't focus on two points when there are a couple hundred that you're excluding now. I think there are some visual and interaction ways to keep people rigorous, but that's still not enough. We can get a 45-year-old participant who has never touched a tablet before using this in five minutes, and they can pick out a house for them, but is that ethical to have them look at something for five minutes and find data points that correspond to such a big decision and not necessarily understand the ramifications because visualization can read as so authoritative? Jaime and I were talking earlier about the difference between text and visual, and how people are pretty well prompted at this point to deal with text resources, rank search results, and know that the top may not actually be the top. That's not necessarily true when everyday people interact with visual systems like this, and so part of it is kind of stamping expectations lower, saying, yes, this is one way to look at the data, but this may not be the full way to do it. And I'm not really sure yet what that looks like. Do we design more systematic things to say, hey, wait a minute, you've got to check these things before we actually go through with any decision. Do we just adjust the interface so it won't show you things if it's uncertain? It's a whole continuum that I'm not quite sure about. This is a problem even in Wikipedia, where people may not be able to interpret the syntax that people use in discussions when they're negotiating down the page. If they reference WP:Peacock, do you know what that is? It's actually a policy that says don't use exaggerative words, but how do we make sure that we level these things properly? >>: [Indiscernible] the points to stay on the screen even if ->> Jeff Rzeszotarski: I see money there, but that is real ->>: I said get that thing out of there. >> Jeff Rzeszotarski: But I think that's the sort of tension this work sits in. And I don't have an easy answer for it, because I think it's a really hard design as well as systematic question. Yes. >>: Do you know, are there any good data -- directive data exploration tools that are geared towards kids, for example, as a ->> Jeff Rzeszotarski: That's an area that I'm not super familiar with. There undoubtedly are, and I really need to look into that to understand the issues I think you're talking about. Because I would assume that's the place where you start. >>: Here, that was interesting, too, because -- so I know one of the things people often say about Wikipedia is like, oh, kids shouldn't -- a lot of schools have rules like, kids shouldn't cite Wikipedia as a source in their reports for school, because it's not authoritative or something, right? And I thought that in Danah Boyd's book that she wrote last year about teens' use of social media, for some reason, it wasn't really germane to the main point of the book, but in one of the chapters was an aside about Wikipedia that I actually thought was one of the best explanations I've ever read for why teachers should let students cite Wikipedia as a source in a paper. And it was focused on the fact that all of the background parts of Wikipedia that people don't normally read are actually really educationally informative for helping students, young kids like teens, understand the nuances and subtleties of both the credibility of information and how it's generated over time and what is and isn't controversial and incorporating all of Wikipedia and not just the surface features could actually be really important, educationally. And I wonder, seeing your system makes me think about -- your system is designed for adults who are contributing to Wikipedia, but I wonder if you know of or have any thoughts about tools that would actually help middle or high school students who are consumers of Wikipedia to be more informed consumers of some of this background content in a way that would enhance their education. >> Jeff Rzeszotarski: Training internal filter better, if nothing else. I'm not familiar in the Wikipedia context of any systems like that, but that's a really interesting angle to take this. And then the question becomes how do you surface the right -- because I think the contextual information you want is then a little bit different. You may want to bias it towards successful interactions, as opposed to people fighting and coming to no resolution, without any other progress, or niggling over a very tiny detail and not an architecturally important part of the page. This is something I really had not gotten into much in the talk, but curation underlies a lot of this work in the sense of what information you choose to present and why, because inherent in constructing these topic models are features that raise or lower certain parts of discussion versus certain parts of discussion versus certain parts of the page, and inherent in the crowdsourcing work, what behavioral features are surfacing and why. And that really influences the end conclusions people make, and I think that's dictated a lot by who the perceived audience may be. It's an area I have not theoretically explored much and it sounds really, really interesting to dig into. Awesome. Thank you, and thanks, digital people.