>> Andres Monroy-Hernandez: Welcome everyone. Today we have Brian Keegan. He's a data scientist or a neural scientist. He's a post-doctoral researcher in Northeastern University with David Lasar and he got his PhD in Northwestern University. So he's going to talk about narrating with data today. >> Brian Keegan: Thank you Andres. Thank you all for being here. I'm excited to be here and talk about some of my work, so this is actually the first time I'm getting this version of the talk so hopefully you guys will not begrudge me when you see another version of this probably five years from now as I refine it. This is the outline of my talk and we'll come back to this slide more and more, but I want to sort of give you a little bit of context from where I come from in network science and in particular type of network science that I do. I go into sort of the event log data that I like to work with and how I turn these into sociotechnical trajectories, then I walk through, go through three different cases and sort of at differing levels of abstraction from various sort of fine to very sort of macrolevels of what we can do with this social technical trajectory approach and then we'll wrap up with the discussion. I differentiated on that first slide sorted between sort of network science and network theory, and so when we think of network science we usually think of someone like a professor Barabási here who has been very instrumental in thinking about how do we sort of model networks and develop sorts of statistical models and networks and they are very much concerned with how, examining the similarities across different kinds of networks. If we look at biological or social information networks we see very similar kinds of patterns in the topology and the structure of these networks. They're scale free and they have various sort of skewed degree distributions like that bottom image there, but these models that they often sort of instrument of coming from sort of the physics side of things usually assume that people are sort of protons or proteins or particles. They don't really sort of differentiate between what it is that we are modeling, the interaction that we're modeling here, but they're very sort of interesting and powerful models because you specify sort of these microscopic models like how a single node behaves and how they make sort of individual kinds of choices. You're able to generate these very complex sort of macroscopic sort of processes. But this whole sort of domain of network science really ignores a lot of the economic and political implications of what these network features mean. What does it mean for our society or for our legal system or for how we allocate resources to have something that is sort of skewed in these sorts of ways? We just kind of, they confirm respective like this is what it looks like and let's try to explain that. On the other side we have something like Professor Benkler here who also talks about networks, but he talks about networks in a very different way in which how do new sort of information technologies enable us to sort of develop new forms of organization, new ways of knowledge production. He's very much focused on these questions of sort of what the economic, legal and sort of political implications of having a network political economy. What does it mean to have sort of a Wikipedia or a Linux or something where people are able to sort of collaborate in large ways to kind of create new forms of knowledge and new sorts of artifacts? I don't know if I'm allowed to show the Linux penguin here at Microsoft. [laughter]. He also is sort of very concerned with not these sort of microscopic questions about how do notes make links to each other, but these macroscopic sort of concerns, like what does it mean to have a polarized blogosphere, like this image from a lot of Domacha’s [phonetic] work back in 2005. At the same time, when he talks about networks he doesn't talk about it in the same way that Barabási would talk about it where you know we're looking at sort of the particular kinds of structures and how these networks are similar or different, so Benkler sort of ignores these sorts of, what are the network features that generate these kinds of particular Wikipedia or Linux. Are Wikipedia and Linux similar or are the way that the people collaborate similar? What does that mean for them to be similar or dissimilar? Beyond sort of the need that we need to reconcile the fact that we got social scientists and physicists doing these different sorts of things, different methods, you know, Benkler on one hand -- you know, they really don't speak to each other often about what it is that these networks -- they both talk about networks, but they talk about networks in various different ways and so Benkler talks about, he's really silent on what does it mean for these differences of collaborators? How do they sort of form a cohesion? How do they produce this knowledge? What do the structures look like? How are they different? Barabási is really kind of quiet on the fact of what does it mean for the implications? What are the outcomes of these network processes? What does it mean for these networks, what is it that they kind of produce? What I'm going to talk about here is how we're going to sort of understand the structural characteristics of self organization in a variety of ways and how we can develop some methods to sort of compare how do we observe sort of self organization in sort of these large-scale sociotechnical systems, in particularly sort of this idea of stigmergy. This is sort of a theory from biology where sort of there's actors in an environment and they don't sort of -- they self organize and they coordinate their work, not by sort of directly communicating with each other, but actually sort of modifying the environment around them and then they sort of incrementally build upon other people's contributions behind them. So you get incremental contributions in a shared information environment are the ways that work is coordinated. It's not through sort of direct communication, so how do we sort of -- so my assertion is that we would see sort of some processes of stigmergy happening in these sociotechnical systems. It's not sort of direct interactions that we want to look for, but rather sort of the processes of modifying each other's work is what is sort of the operative relationship that we should be examining here. To do that I sort of want to go over what an event log is and what event logs are in a variety of different sorts of systems here. In Wikipedia if you go to any sort of Wikipedia article you can go and click on the revision history and you'll get sort of a segment that will look like this. You've got a list of all the different contributions that editors made. In this case you've got editors who are IP addresses or bots or a person right here with the timestamp and they're adding content. They're removing content. You can compare the differences between that and the previous version. We could do that or something like GitHub where there's also event logs that sort of document all the changes that have been made to a particular code repository. Again, you've got a person making a change that you can look at sort of the difference between that with a timestamp and for this particular kind of action, like an edit or a commit. On Reddit you've also got people who are participating and kind of creating a comment thread, a coherent that let's us sort of see, you know, it emerges into a whole sort of narrative of people interacting and producing this one kind of sort of knowledge artifact in a way that the whole thread is something that you can read through. But again, they are sort of modifying the previous person’s thread with the timestamp and so forth. Or even something like Scratch where you can go through and see whether or not people’s sort of history of the changes they've been making to the system as a whole, if they been loving, or favoriting or liking something. And then on a community like Scratch, so in all of these event logs we have these sorts of activities that people take; they make a commit. They make a revision and then they make a comment within a system and there's records of that. There's a case that they are doing this to something. It could be to a repository or a particular code bank. It could be to a Wikipedia article. It could be to a discussion thread on Reddit or something like that. We've got a record of a performer like who is actually doing this. It could be a user on GitHub. It could be an editor on Wikipedia. It could be a member of an online community. It could be a human editor. It could be a bot, an automated sort of entity. And then we also have an order. We have these things happen in a sequence and this is sort of really crucial and I'll explain why in a second. You can sort basically these records in a way based on time or a position or depth line on thread and that this time order is actually going to become very interesting in that if you do sort it you take sort of the Wikipedia history log here or we've got GitHub or we've got people making commits to a specific sort of Wikipedia article, for example. Editor A made a contribution to article X at time 1:01. At time 2:02 we see this sort of event log sort of emerge. It's sort of a general abstract way that we can look at an event log, but I'll explain why this kind of temporal ordering is important in a second here. And this is what sort of the sociotechnical trajectory is at the core of sort of looking at these relationships within these of them logs. And so before I get to that I want to talk about when we've looked at sort of the scholars who have looked at the sort of sociotechnical systems before, they very much sort of focused on, you know, these questions of self organization hierarchy that I'm also interested in of open source systems and so on. So there's a huge sort of literature on this, but sort of the relationship that they look at here often, this relationship. Like who works on what articles? So it's user A, so you've got a node who's a user and another node who's an artifact, so the user A might have a relationship with an artifact that user A’s making a contribution or an edit or a commit or something to artifact X like that. So then you say well we can mine that event log. We can take all these different relationships and we can create some beautiful picture like this. In the file you've got all these red nodes, which are sort of the articles in Wikipedia and the blue nodes are the editors who are editing them and you see that oh, there is all of this complexity and heterogeneity and there's some clustering down here and there's some other ones that are kind of isolated, but you know, that are all sort of connected. And you can do all these sorts of really interesting network analyses, but if you wanted to kind of just zoom in and like look at what did the collaboration look like on one article, you can say great we can do it across a whole bunch of articles. That's really interesting. But what does it look like on one article? So if you go back to that sort of the article trajectory or sort of this individual history of contributions, the way that we do that sort of traditionally is we would say okay. At time one like editor A made a contribution to article X, and at time two B came along and made a contribution. At time three C came along and made a contribution. At time four A came back and made another contribution, so maybe we make that line a little bit thicker, but it's unclear like the ordering. We can't sort of discern what the ordering is on the system. Maybe D comes and makes another contribution and then maybe A makes the sixth contribution there at the end and so in the end we see an emergence of a structure but what comes out of that for looking at just a single artifact, if we're just looking at how a single Wikipedia article or a single GitHub repo is edited, if we do this sort of metric, like we get networks that look like this, where you got the Wikipedia article or something at the center and everyone who edited it. And there's no sort of heterogeneity. There's nothing sort of interesting going on in terms of who is getting what or who's working with whom, right? You know, it's just this simple kind of story. So instead if we sort of draw on this other set of literature about narrative networks and sequence analysis and process data and we're looking at the sequencing of events, right, the interactions of someone modifying it for someone else, these processes of things unfolding and taking turns and all these sorts of -- there's this rich sort of literature on sequences and cycles and processes, right? So if we instead look to these things and ask a different kind of question of who modified whose work. So it's not that like A is editing article X, but on article X who’s working with whom on that and we in particular look at the sequence of events that user B modifies user A’s version, that user B is editing immediately after user A, so that by implication user B is modifying user A’s state of the article. Instead, so those relationships begin to look like this, that you have this Apteva is modifying Kennvido. He made a change at a particular point in time. It may not have been editing, you know, this particular set of phrases, but, you know, when he last touched the article this is what the article looked like. This guy comes through and makes a quick change like the stuff used to be all caps and so he's going to go through and say actually let's kind of do it more traditionally, how we would write a headline like that. So the relationship here is that Apteva’s modifying so Kennvido owned the article for a brief point in time, like that's what the article look like and this editor comes along and modifies that state of the article modifying what Apteva had done before. And if you look at like GitHub you can also see that, like a we've got people removing some code and adding other types of code like this and so when previous person who sort of owned this article for however brief time. Some other person comes along and modifies that state of the article. If we look at that in sort of abstract sort of trajectory approach the relationships all of a sudden isn't that A is editing X; it's that B is following A and C is following B and then A is following C. D is following A and A is following D, right, so this is the sort of operate sequence relationships that we look at. You also have this repeated sort of -- you've got the same editors appearing or the same users appearing over and over again so we can sort of collapse a stand to it, a graph, where there's only one kind of user, so like A only appears once, and then B modifies A’s version and C modifies B’s version and then A modifies C’s version and so we don't have another A that goes back to A. We create this cycle all of a sudden that the narrative comes out of the fact that we're not sort of modeling each sort of point in time at each individual edit, but that an editor owns a particular edit and that edit can come back to you or go away from you, but you sort of are represented in the same way over time. And so you have a cycle here where people, it came back to A. A just decided to jump back into the collaboration. He wasn't happy with what C had done or he was happy and wanted to keep building on it. Then D comes along and wants to modify A’s version and A still doesn't like that and so what we begin to see out of this sequence all of a sudden is a much sort of richer and heterogeneous picture of, heterogeneous picture of this sort of collaboration, where it's not the simple star anymore. You got sort of rich, even from this toy example, much more sort of complex network emerging. And then if we actually sort of applied this sort of method to an article like Wikipedia, in particular, a breaking news article which is something I'm very interested in. I've had these collaborations on Wikipedia articles about breaking news events are different from other types of collaborations on Wikipedia. We look at sort of an individual article and we use this sort of method, we all of a sudden have a network that looks like this. Sure, it's a hairball, but it's also a much more complex and structured heterogeneous - I can't pronounce that word today for some reason. All of a sudden we have this network and this is an article about the 2011 Japanese earthquake and tsunami. The nodes here, again, are the editors as before and the relationships are if one editor modified a previous editor’s version of that article. And so the nodes here are sized. How big they are is how many sort of other editors work they've modified, but we can begin to overlay some more information on this. We're going to dive a little bit deeper into this Japanese earthquake article. So this is the same image as before and so I've only added the coloring of the nodes, but this isn't any sort of new information that we didn't already have in the event log. This isn't sort of new independent observation. What it is is just that we're going to color the nodes a bluer color if they edited the article earlier and we're going to color the nodes in a gradient towards sort of redder color, if they joined the collaboration and they started editing later. It becomes really apparent and all of a sudden at the very early stages this article, these blue editors working together when the article is very young. They're modifying each other's work in the first sort of hours or minutes after this event occurs you see this very sort of dense pattern of work where they are sort of passing the article around to each other and it's this very sort of dense pattern. You have this sort of this very prominent central person here is a blue editor. He or she joined this collaboration sort of very early, but then you see sort of these red editors out here and so these people have a very different sort of pattern, right? This is sort of like a hairball dense clustered highly sort of cooperative sort of work, but out here you've got these sorts of strings were things get passed down and so they look like these giant sort of orbits. And this is mainly just an artifact of the visualization sort of algorithm that I used, a kind of spring embedding. By implication what it means is that these editors -- this guy sort of makes a revision and this guy is modifying that guy’s version and this person is modifying that person's revision, but the fact that we have this sort of string where like they make a revision, but then like no one else, all these other people who made a revision who have contributed something to this article don't decide to jump in and modify this person’s version. They leave it to some other newcomer to come and modify that version. It says something really sort of deep about sort of the way this collaboration’s unfolding. These people have left the collaboration and instead we have sort of these new people coming in and modifying it but they can't be bothered to sort of continue in the collaboration at a later point in time because if they continue to collaborate they move to the center. They would be working with other people. They would be modifying other editors’ work as well. So we can sort of look at sort of different features of different things here. We have the first editors here, so these are, again, the first editors who made the earliest contributions to the articles, but they don't sort of persist. They worked together very early on, but like as new editors came in and more work was being done, they're not here in the center. These editors left, right? They edited at a very early stage and didn't continue to sort of persist. We've got sort of a core group of editors who are editing very intensively with each other. And this is another group of very early editors. Like I mentioned we've got these sort of peripheral editors who join much later and are working together in a very different sort of way. Then you've got this one sort of highly central person who is making contributions. And then you've also got some other people out here who are sort of highly central but in a different way because they are surrounded by different colors. Like this blue person is working with other blue nodes. They are working with other blue editors. They are working with other editors who joined relatively early. This editor is modifying the work of other editors who joined much later and, in fact, if you kind of zoom in and look at what he was doing, it's actually not a he or a she; it is a robot who is just modifying the work of vandals. These are people who joined very late, so he joined relatively early because of this greenish color, but then the only work he does is sort of modifying other sort of vandals, so he is reverting the work of other vandals that people come by and blank the page or say, you know, that tsunami was awesome. They try to sort of undo that sort of work automatically. But we can also sort of color the nodes a different way and again, we're not introducing any new information. We know when the editors started editing. We know when there is sort of a last observation that they edited, and so we can say when did they sort of stop editing? When was the last observation of when they were editing? Again, these bluer colors would suggest that they joined early but they also stopped editing earlier. This person joined early as before but this greenish color says that he stopped editing at sort of a medium stage, but you also have these others here who were blue before but are now kind of red and they're are sort of the core editors. These are people who joined very early and are still editing much later until sort of the present day. We begin to see a sea of sort of the emerging social roles of people becoming custodians of trying to maintain these articles over time. We can do other sorts of layout algorithms here as well. These attributes, there are other attributes that we can specify for these networks. Because we're looking at these event logs and we know how much time has elapsed between edits and that's kind of the optimum relationship that we are looking at, how much time has gone on between these, we begin to use that as another attribute that we can bring in. So these relationships that editor B modified editor A’s work, that's editor B edited after editor A, but we can also say that editor B edited sometime after editor A. We can put on like a latency. We can also go in and say what was the size of that difference? How much content were they adding or removing? We can even look at sort of the features of the content. So here's the network as we had it before. If we wanted to have just a subset of the data and just look at what does this network look like when we just look at relationships that are happening very intensely, very quickly. People modifying each other's work in less than 60 seconds, then we again sort of unpack this sort of this backbone of work largely among the early editors. We also see it among this vandal fighter here has a very distinctive pattern of sort of quickly modifying other people's work. We also see that this very sort of central person all of a sudden doesn't have that many strong ties anymore but we do see sort of strong ties out here that these are sort of people that are editing very early on the article’s history but it's also where the most intense work was going on. People are kind of trying to work together very quickly and are modifying each other's work, you know, with less than 60 seconds going by. Then we can also look at this in a different way and say let's just look at the subset of the nodes, just like remove a bunch of nodes here, and only retain the nodes if those editors have ever added more than, who ever added more than 50 bytes, if they ever added a sentence worth of content. Again, these are all of the editors that have ever touched the page and ever made a change to the page, but if we ask the question like who's added more than like a sentence worth of content, all of a sudden this network becomes much more sparse. So the work that all these people are doing, that these other sort of thousand editors are doing is incredibly sort of micro sort of level work that they are changing particular sort of commas, or they're doing stuff that's less than 50 bytes of work, and when you look at just the work of people who have ever added more than 50 bytes of content to this article, you get a sort of very different picture and a lot of the very central people that we saw before don't appear there anymore. We still have two central people here. We're going to explore them in a bit. I point out those two sort of people at the bottom and so just like I did before is we dove in and we looked at the event log for a Wikipedia article, or we could do this for any sort of artifact sort of event log. We can do the same for editors. We can do the same for users. We can say, we can go in this case, this is my sort of edit history most recently, and so this is the history of all of the stuff that I've been doing most recently and we can go back and look at all 10,000 edits that I've made to Wikipedia. But we can again, go through and say there is sort of a relationship where I went from editing the article about John Riedl to editing the article about the Sawicki talk page and I did that after the social network, so I'm kind of jumping around to these different topics, but if we kind of aggregate that up and look at sort of the network of all of these sort of relationships, what does that look like? Again, if I am sort of performer X and instead of making changes to these different articles we can model it the exact same way analogously, that I started editing article A. Then I jump to article B, then C and then maybe I went back to A, and then I went to D and then I went to A. All of a sudden we get networks that look like this, and so again, these are very unusual, some of these features are sort of we don't usually see in networks. We're used to seeing these sort of core periphery, these complex sort of hairballs and things like that, but what we see what's on these editors we see a very, sort of different sorts of patterns. We've got this person was sort of that steward, that very strong central person up there, so this user’s name was Flodded. So we can see where they started editing and so the initial edits that he ever made to Wikipedia were about the 2011 Tucson shooting. Again, these nodes are articles. They're not editors anymore. The relationship is if they edited one article after another, so his earliest edits are to this other sort of breaking news event. He's editing this article about this breaking news event and then that event sort of dies down and he starts editing articles about mobile phones and so he's kind of like off in the wilderness. He's off in the wilderness like literally editing articles about Antarctica and then the 2011 -- the Tucson shooting happened in January and the first week the tsunami happened in April. This event happens in April, this person all of a sudden the lights back up and he's really in the thick of editing this article about this earthquake and tsunami. The articles that he's editing here I've colored in different ways. The blue articles are Wikipedia articles that we see, so like the front page or the articles that 99 percent of us read and looked at. The pink articles are the things like the talk page where you have discussions about what we should do or shouldn't do. We've also got these brown pages which are just sort of more sort of highly specialized pages about templates and sort of they're sort of infrastructure pages. Then you also got sort of these turquoise pages which are really sort of the, the pinker ones are sort of user talk pages, so he's talking to other users about what should be done. And then this light blue article is actually the top page for that article. He's jumping around. He's talking to other users about what they should be doing. He's jumping to the article about to the talk page to try to get other people to work in certain way. He's jumping to this other article about how do we sort of like get the number of deaths out and so on. This person is engaged in this sort of very interesting kind of pattern of behavior that he did some breaking news stuff and then he didn't and then didn't and then all of a sudden became very active and then did nothing else after this. Then we can look at other editors too. There were these two other editors that we saw down here. So if one of them is Sandpiper then we do the same type of method. We look at his history of what he's done before. We get this network where he started off editing articles about Harry Potter and then he jumps up and starts editing articles about the nuclear disaster and the tsunami and then he starts editing articles about Cutty Sark which is a British ship and also a kind of bad scotch. What's interesting is that the work that he was doing here on Harry Potter doesn't seem to qualify him at all to edit articles about nuclear disasters or tsunamis or things like that. Harry Potter is on the surface has nothing really to do with a breaking news event, but at the same time Harry Potter is something that's always in the news, right? It's like people are always kind of coming by and saying you should put the spoilers in there. This article is, you know, should have more stuff on Hermione but not as much about Harry, whatever, so people want to use it as a fan site or something like that. The work that he does on the Harry Potter site is like being the dispute, sort of arbitration stuff. You've got all of these light turquoise, blue articles so he's on the talk page saying that guys this is how stuff should be done, like we should chill out. This is the rules. This is how things should get done and so he sort of brings that expertise about how do we sort of like, something that's always in the news and very sort of prominent, he brings that expertise and starts editing that on the sort of the nuclear disaster and tsunami pages. So he is able to kind of migrate that expertise from one area of Wikipedia to another and the flexibility and the ability to do that is really sort of interesting. This sociotechnical approach he is able to sort of illustrate that these things are highly clustered and very distinct for some editors, but they still make these transitions. Then you've got this other sort of editor whose trajectory is sort of much more incoherent and sort of more hairballing that we would see, doesn't have the sort of distinct sort of structure as before, but we still see that there's some structure here that says that here is the stuff that he was doing on the Japanese articles and then a lot of his work before was on environmental organizations, international trade agreements, but again, you sort of see this migration of expertise where he is very much an expert on nuclear policy and international trade agreements about nuclear issues and bringing that to the nuclear disaster as well. I mentioned before that this blue color was very sort of important and this is sort of my favorite case here is that you've got this one editor here who only edits. He doesn't do any of this kind of background coordination stuff. He just wants to edit articles that are like, just the content of articles. That's all he cares about. After the tsunami you have all these towns and cities that were wiped out and so he's editing all those articles. All the work that he did before was about like Japan and Japanese pop bands and Japanese serial killers and again, nothing would seem to qualify an editor who writes about Japanese serial killers and Japanese pop bands to edit an article about a huge national disaster like this, but for the fact that he obviously is someone who lives in Japan and has a lot of sort of cultural familiarity with, you know, what are the towns and what are the landmarks that we need to update about this, and so we see again him migrating from this early sort of work that he was doing before down to migrating the work on these affected towns and landmarks. There are ways that we could lay this out again to show that not only did he go from editing this stuff to specializing in this for a while, but that he actually went back to doing what he was doing before. I haven't shown that here. So this is just a single case study about the editors and the single sort of set of articles around this Japanese earthquake and tsunami. But if we begin to sort of look up at another sort of level we get a whole class of articles. In this case we will look at a class of articles about airline disasters. If we do this sort of method that I described before, if we look there is these differences where there's airline crash articles and Wikipedia if we look at the article trajectories or artifact trajectories like which editors are modifying which other editors’ work, we can kind of see by visual inspection there's two very distinct classes of networks that come out of this. At the top we have sort of this characteristics of this is what networks often look like, what the core periphery is sort of like the hairball. We've got kind of hairball networks on top, but the bottom these are also networks but we've never seen networks like this in sort of nature or any sort of social system like this, but these are also networks that we see in Wikipedia and the distinction between these articles is that these articles on top were authored as breaking news events. The event happened and within like a few minutes of the event happening people were jumping in and trying to edit this article. These were articles about things that happened years and years had passed, so that, you know, this airline crash happened in the mid-1990s, both of them, so obviously there's no Wikipedia around so ten years went by before anyone got around to writing these articles. As a result we see that the structure of these collaborations are very different, the way that we produce the knowledge about these, they're the same sort of content. These are both airline crash articles, but the way that this knowledge was produced it was produced in very different ways and we see those structures very sort of clearly. We can measure that with a variety of different sorts of like network measures that we can say we want to sort of measure this hairballness of the networks. We want to say the one on the left is more of a hairball than the one on the right, so we can operationalize that in at least four different ways. There's many other ways we could do this as well. These are sort of very easy to compute and easy to sort of understand methods. We say that short diameter is like how wide is the graph. How many steps would you have to go between sort of different hops to say from one node to the other at the longest, so the shortest longest path. The one on the left should have a shorter diameter. These editors can just jump from one to the other. It would be relatively easy to get from one to the other if you computed that statistic versus this person you have to jump through every single editor. These are extremes, obviously, but they are illustrative because this one has a very short one and the one on the right would have a very long diameter. Closest centrality would be a measure of how close are nodes to each other. Can they sort of reach many other nodes in a few steps? And if we measure that across and we average that across all of the nodes in the graph that, you know, this one on the left the editors are sort of more close to other editors in general. The one on the right, these editors are relatively far apart on average from other editors and so the one on the left we expect there to be more closeness. The one on the right there would be less closeness. Between this there is this sort of interesting idea like brokering that you kind of connect editors who would otherwise be unconnected, but here we can think about between is not as sort of this classic idea like brokerage, but instead is like a measure of fragility that if you like took out that node, like would the graph suddenly become disconnected in a average sort of sense. The graph on the left, you know, there's sort of overall low betweenness because there's no sort of single central node that is like a point of failure versus the graph on the right almost every single node is a point of failure. To get any one of them they all have to work very hard in between us in neutrality. Then finally clustering, this one is more clustering, more of its editors work with other editors who themselves have edited with. The one on the right no one is editing with anyone else except the person immediately before them. If we look at sort of we do a statistical model and we ask are these sort of breaking news articles different in aggregate if we look across this whole set of articles about airline disasters we do see that depending on how long it takes for that article to be created, we do see that there are shorter diameters. There is more closeness. There is less betweenness and more clustering, so across all four features we find evidence that, in fact, these breaking news articles have more of this hairballness pattern than non-breaking articles, right? So these breaking news articles are, in fact, structurally different from non-breaking news articles and that's largely a function of how much time it takes for those articles to be created. Then we also have this idea, these trajectories also capture something about the temporality of these collaborations. If we look at some of these editors joined and start editing at a particular point in time, the collaboration has a characteristic sort of pattern. In the early stages maybe it's highly clustered and looks something like this, where again, these are editors who are working with each other at a various early points in time. But if we look at sort of the collaboration much, much later, this again is the Japanese example that has perhaps a very different sort of structure. But even though it's the same article, so the way that they are collaborating on these articles may itself be changing over time. That is as time goes on they may look sort of highly clustered and hairbally very early on, but later on it may, in fact, they revert back to sort of like how nonbreaking articles are edited. So we can, again, sort of estimate a model down here and say that is it the case that the way that -- the structure of these collaborations at a later point in time if we just subset that data, do those collaborations later on look very different from those collaborations sort of very early on. In fact, we do. We see that they sort of regress to the mean, that they begin as time goes on these breaking news articles, this clusteringness of them only occurs in the immediate aftermath in days zero through seven, but if we look sort of later on these articles begin to get longer diameters. They begin to have less closeness so you can have more betweenness and they have less clustering, so here we again find evidence that this way of collaborating by using the sort of trajectory method we find that this way of collaborating is very sort of unique to the immediate aftermath of these events that we see that sort of effect begins to go away as time goes on and they begin to look more and more like normal collaborations. The differences actually do not persist over time. This last case I want to talk about here is we've talked about a single case about an article and some of the editors who contributed to it. Then we looked at sort of this class of like all of these articles about airline disasters, but if we begin to look at like a set of like 3000 articles about hurricanes, if we look at articles about airline crashes, if we look at about like accidents, industrial disasters and wildfires and health outbreaks. We look to really broaden the scope to include all these different kinds of articles on Wikipedia, we can ask some other sorts of questions. We can begin to imagine like if we layered these trajectories one on top of another, so before we were only looking at trajectories in like within single articles, so within the Japanese earthquake disaster article, for example, right? But what if we said well there's a set of editors who edited this article. Well maybe some of them also edited articles about airline crashes. We can like layer these sort of trajectories one on top of each other and like look at this union or the intersection of these and what does that sort of trajectory look like? And so we can say that on one article maybe editor A works with and modifies the work of editor B. On article two editor A modifies the work of article B. But these relationships on article one and article two both occur on the breaking news articles, so we can say these are the same editors. They actually have two different kinds of relationships and we can collapse those two different kinds of relationships into a single relationship where there's a way. They've work together on two different articles, but on those two articles both of them, or 100 percent of that work was done on breaking news articles, so we give it kind of like a redder color. So we can color these edges by different sorts of colors and say maybe the bluer ones are non-breaking articles so they work together on a breaking news article and a non-breaking article and so they still work together on two articles. We'll give it a different color because it's a mix of different types of articles here and so if we again run this across all 3000 articles, we get a nice picture like this. Again, these kind of gray nodes here are the editors and these are editors who interacted across two or more articles and so this is sort of the zoomed in giant component of this. The redder colors that connect some editors to other editors means that they interact a lot on breaking news articles and if they have a blue color that means they interact exclusively on these non-breaking articles, right? And then we've got sort of these colors that are light green which means they sort of interact on a mix of like breaking and non-breaking articles. And so we see there is, in fact, a strong sort of core of editors here who edit across many different breaking news articles. They worked exclusively together on breaking news articles given this sort of red color and then there's other editors who appear to work together repeatedly but on non-breaking news articles and so on. But this cluster down here is sort of really interesting and illustrative. So these are editors, right, who are working together. They, we've got records of them modifying each other's work across so many different kinds of articles, but it's extremely dense. It's almost as dense as we would see up there. But it's also so removed, that they are not actually working with those editors or modifying those editors work; they are modifying each other's work across many different articles but not necessarily modifying these other editors’ work. They kind of just work by themselves and they are relatively isolated, but not completely. But it's also this interesting color, right? They are not working only on breaking. They are not working only on non-breaking or there are some who work only on breaking, but there's a lot here who work on a mix, almost a perfect mix of breaking and non-breaking. So these are editors who added stuff about tropical cyclones. So this is like a really interesting sub community on Wikipedia where Wikipedia has these wiki projects where people sort of get really excited about editing articles about military history, roads and bridges, and so there's a group of folks who just like to edit or create articles about every single hurricane or tropical cyclone that's ever existed. So they write this in a very formalized like standard way that this storm developed as a squall off the coast of Africa and it moved off and became like a tropical depression, with this kind of barometric sort of characteristics and so on and so they have just like this sort of like almost like perfect way of just writing these articles, just copying and pasting them almost, and they are filling in like what changed for these articles. And so what we see is that these people are editing. They kind of show up in my breaking news corpus because obviously there's breaking news articles. These hurricanes are breaking news articles often but these people are also working together on, because they care about hurricanes. They don't care about breaking news articles, so they are also editing these other articles about, you know, hurricanes that happened 50 years ago, but they are working together and so what this method has really uncovered is by identifying these trajectories overlapping in this way we've highlighted like a community practice and this is what we would call a community practice in the very sort of traditional way. This is people who are like kind of jointly attending the same sort of topic and there's ways to join the community and direct boundaries that they are not sort of part of the rest of the community. They are engaged in a particular type of work that they care about that is differentiated and so on. And so these trajectories approach that I sort of propose and I think that I've tried to demonstrate here has, we've gone from this very abstract idea to sort of these event logs where sort of we can see that if someone edited after someone else and we've arrived at this place where all of a sudden we can begin to sort of uncover these sort of really complex and these sort of anomalies in this data by using this method and so I just want to wrap up here with a discussion. So that these event logs are ways to sort of narrate changes to artifacts, Wikipedia articles [indiscernible] repositories and users over time. These are the history of someone becoming an identity or an article sort of becoming more complex or something like that, right? We can go through and look at in previous points in time of what these articles, what these artifacts, what these users looked at and are doing at previous point in time. These event logs are pervasive through lots of sociotechnical systems. We see them in GitHub. We see them in Wikipedia. We see them in Scratch. We see them in Reddit and so on. We can go through and find these data almost everywhere and yet we don't sort of realize the potential that we could begin to build relationships out of these, right? That if we look at these sort of temporal adjacencies, the fact that I work immediately after Scott who works immediately after Andres, all of a sudden we can begin to sort of say or do some interesting things about creating these trajectories that have these really sort of complex and interesting behavior, right? They collapse this sort of very large-scale data into these sort of nice sort of graphical representation that we can then sort of do traditional sort of social network analysis things on, right? And that they captured this idea this Stigmergy. We've got this accumulation, this gradual accumulation of effort of individual contributions create these sort of very emerging sort of complex structures. Again, we're not sort of looking at sort of the content level features that we were working on the same sentence or anything like that. It's just the fact that I edit after Andres or something like that, but by aggregating these sorts of interactions altogether we could be sort of very interesting and very complex structures that are nontrivial and they reveal these sort of, the patterns within these trajectories begin to highlight sort of the social and the temporal context of work, right? That it's, we see that in the community practice with the sort of, the hurricane editors. We see that they are sort of more active editors are not sort of deeply involved in -- they've joined the article very early, right? They had the blue editors. They joined very early and they stopped editing, right? There are just sort of this interesting sort of behaviors that we capture and this context of work that people were involved and motivated to work very early but weren't, in fact, motivated to sort of continue to work going forward. And so I think what's most sort of powerful here and this is probably like the hardest one to read, but that trajectories also exposed sort of regularities and anomalies for sort of following faulty analysis where we see like sort of the regular pattern how these networks behave, but then we see these anomalies, that there's people who are like highly central. Like what does that mean? Like those editors are coming from somewhere. They had some work that they had done before. They had some work that they were doing afterwards as well, so we only capture that in the single snapshot, but then we can begin to sort of unpack that in a much more sort of rich way, where we see these anomalies like this sort of group of editors who were working together but weren't necessarily working strongly with a lot of other people on all these different breaking news articles. That's an incentive to kind of dive in there and look deeply at what's going on in there, and so we are able to unpack from these trajectories these sort of interesting anomalous sort of structural features that we can then begin to sort of use, then motivate sort of follow with qualitative analysis and so I think this approach, the sociotechnical approach that I've discussed here is an interesting sort of like mixed method sort of thing, right? That we can both use it to sort of mine, like data mine on an extremely large-scale sort of databases that we have that are pervasive and all these different event logs that we have, but we can use that sort of collapse that down to something that is relatively easier to comprehend and make a network out of that. We can begin to sort of run the network analyses on that and then we can begin to see that there are the anomalies that then we can use to motivate some interesting qualitative analysis and sort of tie these things altogether and really understand the complexity of these systems. With that, I'll wrap up and I thank you guys for your attention. I'll take questions. [applause] >>: I've got a conceptual question. >> Brian Keegan: Sure. >>: You are sort of framing this whole entire thing as collaboration between people, but then the relationships we're actually looking at are proximity in time, so like you might edit an article that I previously edited and then I might never go back to it. Are we really collaborating on something? It seems to me that this is more about like participation, so like what distinguishes breaking from non-breaking articles is the patterns of participation, not necessarily collaboration. >> Brian Keegan: Right. So there's probably inprecision in like the language I use and I think that we can sort of split the hairs here about what sort of coordination, collaboration and all of these sorts of things, right? But it's important to kind of separate all those things out, absolutely. One thing I think you brought up though was the fact that, the fact that these relationships are actually implicit. Like in social networking analysis we often assume that like we make an active choice to e-mail someone or to talk to someone like that's at some agentic like I decided to talk to you but not someone else, but in fact, what this method is uncovering our these like more implicit or like unassumed ties. Like I might be modifying an article not know that it was your version of the article, but and so and that’s maybe just purely coincidence. Maybe that's all we're capturing, but if there was sort of really all pure coincidence then you would just sort of see these random graphs where nothing is sort of happening, but what we're actually seeing is something very different where we see that there's these patterns of activity, like where you are engaging and then I'm engaging and the fact that you don't engage anymore says something about how this collaboration’s working, that you chose not to engage anymore. You are not sort of paying attention to this thing that, this isn't something that you care about potentially. It could be that it captures something about do we pay attention to each other’s sort of work as well. And so I think that in the sort of very micro level, like a lot of this may just be sort of coincidental sort of ties, right, but I think that by sort of, there's lots of data here. You begin to mine that up and you would begin to see sort of that begin to wash away and so there might be like a baseline level of like randomness in this sort of approach, but we begin to see these sort of structures emerge that suggests there's something very much non-random and non-coincidental about these ties. >>: Yeah, so my second question. Have you, you start out very motivated [indiscernible] sequence analysis kinds of things. Have you done anything like what's the whole set of network [indiscernible] like [indiscernible] events? >> Brian Keegan: I haven't done anything like that, so that's kind of like the next approach here. So having just on this admittedly just purely descriptive sort of work here and saying like you can create these kinds of networks using this sort of approach may be very complex features, but all of a sudden we've got networks now. That looked like networks that we often see in other sorts of contexts, so now let's begin to sort of bring in these other kind of methodologies that we have in social network analyses, do some statistical modeling that we can look at what's the likelihood that some people creating ties with each other and so on. How do these things sort of evolve and can we create sort of like Markov chains like people making decisions about when they sort of jump in there and so on, yeah. >>: I think that perspective could be really interesting because like the sequence of ties is explicit in that [indiscernible]. That's where you are kind of obscuring the timing of the edges, not really, because like forward past or forward time, right? They are all kind of in one thing together versus not the kind of framework where you think about like -- yes, a network but sudden nodes where edges occur discreetly like overtime. >> Brian Keegan: Right. So these networks are sort of unique. I don't have like a graph theoretical like mathematical background to explain exactly what the right word would be, but like they are unique because you can actually follow a path on that network which is the editing sequence. You can like walk this graph and show like how sort of who has touched it when, right? And so that creates some biases may be in terms of what kind of ties can or can't exist, right? But there's, I'm thinking people will want to fight through how to model this. That's exciting. >>: That's a whole other thing. >>: I totally dig the approach of getting down into sort of the action in the network to sort of bridge the bare bossy Benkler gap. Very cool. And also, sort of apply [indiscernible] and it's hard to kind of scale that up, right, because you are focused on sort of individual level actions. It's cool you were able to scale it up to multiple thousands of events. The question then is just coming back to the, you know, how does it relate to Benkler’s like what's important to society and so I'm just curious for your thoughts on what do you think are the tightest operationalizations of this work to the this sort of Benkler’s asking what's important for society? Like is it about we can understand how different networks in terms of this sort of activity, the editors, enable societal level self reflection, or efficiency or sort of information spread or, I mean what are the, I don't know. Does that make sense? >> Brian Keegan: One thing that was motivating and maybe it's that roundabout way of getting to that was that so Benkler talks about these sort of forms like networks like knowledge production, but when we talk about like a single artifact like how that's produced, like we have no way of kind of unpacking that, so I think that this is one way to begin to see that, in fact, different artifacts are produced in different sorts of ways and so this complicates the idea that like we just call it a network. That's all of a sudden like we can just use that as a framework to understand like these things are all different, that actually within that sort of network form of knowledge production are actually sort of really interesting differences in how those things are done. And so then we want to get in sort of the societal level questions and it's like you can begin to get a question like sort of efficiency, like are there some ways, you know, was the work that's being done that's highly clustered here really redundant? And maybe it is because we saw that not a lot of like effort -- there was only like a minority of people who were doing most of the heavy lifting, were actually making the biggest changes to these articles. In fact, the work there was sort of a capture of that modularity and granularity that he talks about a lot, right? But to get to that question of sort of those outcomes that we care about, I mean it might bear on sort of how would we go about designing systems or designing kind of feedback mechanisms that sort of reward people for assuming certain kinds of social roles in these sorts of collaborations to incentivize them to sort of assume roles that make them more sort of coordinated or whatever interactive [indiscernible] different kinds of editors are highlighting the fact that they are engaged in a particular way of collaborating and not, or maybe that is problematic or it's not in that. And so getting back to the sort of Benkler question that, you know, we usually provide that feedback like monetarily or something like that, but here we begin to use like mine the information, mine your history to say something about and valorize that in a certain way that you are engaged in a particular pattern of collaboration that we know that that leads to these particular outcomes that may be I differentiated on breaking and nonbreaking, but maybe some patterns of collaboration lead to very high quality articles or lowquality articles and we want to replicate that. Maybe that's just a feature of how we generate high-quality knowledge artifacts, that we see maybe this hairballing is like why some collaborations are successful or not because we are part of a community that I know that Andres is looking at what I'm doing and I'm looking at what you are doing and so we feel like the work that I am contributing isn't kind of going like off into the void, but like other people are going to sit down and modify it as well. So in all those sorts of ways I think that we can, this framework suggests ways that we can both study different ways of like how valuable sort of these artifacts could be to get that outcome question, that there's differences that maybe these structures might predict different kinds of outcomes, maybe predict different sort of quality of these artifacts. >>: [indiscernible] one of the focuses of his work is to come up with like [indiscernible] patterns of [indiscernible] systems [indiscernible] statistical. From the specific cases that you looked at on breaking news, what are the kinds of take-home message is that you give to say if Wikipedia were to be reassigned to just focus on breaking news. [indiscernible] like what's not [indiscernible] why are they most important [indiscernible] you pulled from some of the work that you were doing on breaking news? >> Brian Keegan: Sure. So I think that the examples of the editors that we looked at there are the people that are editing articles about Japan or the editors about Harry Potter, I think that sort of traditional organizations, the stuff that, you know, Benkler sort of critiques is that traditional markets are higher so he would say that you are not allowed to do that. You shouldn't be able to translate that expertise to someone else, somewhere else. You have to go through an administrator. You need accreditation or you need someone to say that it's okay for you to go over and edit these kinds of articles. And so in this network kind of way, we could begin to see the sort of the flows of the expertise in these new domains that you wouldn't predict would be it. So what's hard then is like you might be able to sort of create a recommender system to say that oh is looks like you like editing talk pages a lot and it seems that you are a very good sort of discussant or sort of arbitrator or something like that. We are having problems here. Would you mind coming over here and like helping out with this discussion thing? And so maybe a design lever around sort of how to we do match sort of tasks to people better by looking at these sort of patterns of how they worked before, the kind of work that they engaged in before? So we could begin to maybe do some sort of matching like that. Does that cover that, yeah? >> Andres Monroy-Hernandez: Any other questions? No. Well thank you very much >> Brian Keegan: Thank you guys. [applause]