>> Andres Monroy-Hernandez: Welcome everyone. Today we have... scientist or a neural scientist. He's a post-doctoral researcher...

advertisement
>> Andres Monroy-Hernandez: Welcome everyone. Today we have Brian Keegan. He's a data
scientist or a neural scientist. He's a post-doctoral researcher in Northeastern University with
David Lasar and he got his PhD in Northwestern University. So he's going to talk about
narrating with data today.
>> Brian Keegan: Thank you Andres. Thank you all for being here. I'm excited to be here and
talk about some of my work, so this is actually the first time I'm getting this version of the talk
so hopefully you guys will not begrudge me when you see another version of this probably five
years from now as I refine it. This is the outline of my talk and we'll come back to this slide
more and more, but I want to sort of give you a little bit of context from where I come from in
network science and in particular type of network science that I do. I go into sort of the event
log data that I like to work with and how I turn these into sociotechnical trajectories, then I walk
through, go through three different cases and sort of at differing levels of abstraction from
various sort of fine to very sort of macrolevels of what we can do with this social technical
trajectory approach and then we'll wrap up with the discussion. I differentiated on that first
slide sorted between sort of network science and network theory, and so when we think of
network science we usually think of someone like a professor Barabási here who has been very
instrumental in thinking about how do we sort of model networks and develop sorts of
statistical models and networks and they are very much concerned with how, examining the
similarities across different kinds of networks. If we look at biological or social information
networks we see very similar kinds of patterns in the topology and the structure of these
networks. They're scale free and they have various sort of skewed degree distributions like that
bottom image there, but these models that they often sort of instrument of coming from sort
of the physics side of things usually assume that people are sort of protons or proteins or
particles. They don't really sort of differentiate between what it is that we are modeling, the
interaction that we're modeling here, but they're very sort of interesting and powerful models
because you specify sort of these microscopic models like how a single node behaves and how
they make sort of individual kinds of choices. You're able to generate these very complex sort
of macroscopic sort of processes. But this whole sort of domain of network science really
ignores a lot of the economic and political implications of what these network features mean.
What does it mean for our society or for our legal system or for how we allocate resources to
have something that is sort of skewed in these sorts of ways? We just kind of, they confirm
respective like this is what it looks like and let's try to explain that. On the other side we have
something like Professor Benkler here who also talks about networks, but he talks about
networks in a very different way in which how do new sort of information technologies enable
us to sort of develop new forms of organization, new ways of knowledge production. He's very
much focused on these questions of sort of what the economic, legal and sort of political
implications of having a network political economy. What does it mean to have sort of a
Wikipedia or a Linux or something where people are able to sort of collaborate in large ways to
kind of create new forms of knowledge and new sorts of artifacts? I don't know if I'm allowed to
show the Linux penguin here at Microsoft. [laughter]. He also is sort of very concerned with
not these sort of microscopic questions about how do notes make links to each other, but these
macroscopic sort of concerns, like what does it mean to have a polarized blogosphere, like this
image from a lot of Domacha’s [phonetic] work back in 2005. At the same time, when he talks
about networks he doesn't talk about it in the same way that Barabási would talk about it
where you know we're looking at sort of the particular kinds of structures and how these
networks are similar or different, so Benkler sort of ignores these sorts of, what are the
network features that generate these kinds of particular Wikipedia or Linux. Are Wikipedia and
Linux similar or are the way that the people collaborate similar? What does that mean for them
to be similar or dissimilar? Beyond sort of the need that we need to reconcile the fact that we
got social scientists and physicists doing these different sorts of things, different methods, you
know, Benkler on one hand -- you know, they really don't speak to each other often about what
it is that these networks -- they both talk about networks, but they talk about networks in
various different ways and so Benkler talks about, he's really silent on what does it mean for
these differences of collaborators? How do they sort of form a cohesion? How do they
produce this knowledge? What do the structures look like? How are they different? Barabási
is really kind of quiet on the fact of what does it mean for the implications? What are the
outcomes of these network processes? What does it mean for these networks, what is it that
they kind of produce? What I'm going to talk about here is how we're going to sort of
understand the structural characteristics of self organization in a variety of ways and how we
can develop some methods to sort of compare how do we observe sort of self organization in
sort of these large-scale sociotechnical systems, in particularly sort of this idea of stigmergy.
This is sort of a theory from biology where sort of there's actors in an environment and they
don't sort of -- they self organize and they coordinate their work, not by sort of directly
communicating with each other, but actually sort of modifying the environment around them
and then they sort of incrementally build upon other people's contributions behind them. So
you get incremental contributions in a shared information environment are the ways that work
is coordinated. It's not through sort of direct communication, so how do we sort of -- so my
assertion is that we would see sort of some processes of stigmergy happening in these
sociotechnical systems. It's not sort of direct interactions that we want to look for, but rather
sort of the processes of modifying each other's work is what is sort of the operative relationship
that we should be examining here. To do that I sort of want to go over what an event log is and
what event logs are in a variety of different sorts of systems here. In Wikipedia if you go to any
sort of Wikipedia article you can go and click on the revision history and you'll get sort of a
segment that will look like this. You've got a list of all the different contributions that editors
made. In this case you've got editors who are IP addresses or bots or a person right here with
the timestamp and they're adding content. They're removing content. You can compare the
differences between that and the previous version. We could do that or something like GitHub
where there's also event logs that sort of document all the changes that have been made to a
particular code repository. Again, you've got a person making a change that you can look at
sort of the difference between that with a timestamp and for this particular kind of action, like
an edit or a commit. On Reddit you've also got people who are participating and kind of
creating a comment thread, a coherent that let's us sort of see, you know, it emerges into a
whole sort of narrative of people interacting and producing this one kind of sort of knowledge
artifact in a way that the whole thread is something that you can read through. But again, they
are sort of modifying the previous person’s thread with the timestamp and so forth. Or even
something like Scratch where you can go through and see whether or not people’s sort of
history of the changes they've been making to the system as a whole, if they been loving, or
favoriting or liking something. And then on a community like Scratch, so in all of these event
logs we have these sorts of activities that people take; they make a commit. They make a
revision and then they make a comment within a system and there's records of that. There's a
case that they are doing this to something. It could be to a repository or a particular code bank.
It could be to a Wikipedia article. It could be to a discussion thread on Reddit or something like
that. We've got a record of a performer like who is actually doing this. It could be a user on
GitHub. It could be an editor on Wikipedia. It could be a member of an online community. It
could be a human editor. It could be a bot, an automated sort of entity. And then we also have
an order. We have these things happen in a sequence and this is sort of really crucial and I'll
explain why in a second. You can sort basically these records in a way based on time or a
position or depth line on thread and that this time order is actually going to become very
interesting in that if you do sort it you take sort of the Wikipedia history log here or we've got
GitHub or we've got people making commits to a specific sort of Wikipedia article, for example.
Editor A made a contribution to article X at time 1:01. At time 2:02 we see this sort of event log
sort of emerge. It's sort of a general abstract way that we can look at an event log, but I'll
explain why this kind of temporal ordering is important in a second here. And this is what sort
of the sociotechnical trajectory is at the core of sort of looking at these relationships within
these of them logs. And so before I get to that I want to talk about when we've looked at sort
of the scholars who have looked at the sort of sociotechnical systems before, they very much
sort of focused on, you know, these questions of self organization hierarchy that I'm also
interested in of open source systems and so on. So there's a huge sort of literature on this, but
sort of the relationship that they look at here often, this relationship. Like who works on what
articles? So it's user A, so you've got a node who's a user and another node who's an artifact,
so the user A might have a relationship with an artifact that user A’s making a contribution or
an edit or a commit or something to artifact X like that. So then you say well we can mine that
event log. We can take all these different relationships and we can create some beautiful
picture like this. In the file you've got all these red nodes, which are sort of the articles in
Wikipedia and the blue nodes are the editors who are editing them and you see that oh, there
is all of this complexity and heterogeneity and there's some clustering down here and there's
some other ones that are kind of isolated, but you know, that are all sort of connected. And
you can do all these sorts of really interesting network analyses, but if you wanted to kind of
just zoom in and like look at what did the collaboration look like on one article, you can say
great we can do it across a whole bunch of articles. That's really interesting. But what does it
look like on one article? So if you go back to that sort of the article trajectory or sort of this
individual history of contributions, the way that we do that sort of traditionally is we would say
okay. At time one like editor A made a contribution to article X, and at time two B came along
and made a contribution. At time three C came along and made a contribution. At time four A
came back and made another contribution, so maybe we make that line a little bit thicker, but
it's unclear like the ordering. We can't sort of discern what the ordering is on the system.
Maybe D comes and makes another contribution and then maybe A makes the sixth
contribution there at the end and so in the end we see an emergence of a structure but what
comes out of that for looking at just a single artifact, if we're just looking at how a single
Wikipedia article or a single GitHub repo is edited, if we do this sort of metric, like we get
networks that look like this, where you got the Wikipedia article or something at the center and
everyone who edited it. And there's no sort of heterogeneity. There's nothing sort of
interesting going on in terms of who is getting what or who's working with whom, right? You
know, it's just this simple kind of story. So instead if we sort of draw on this other set of
literature about narrative networks and sequence analysis and process data and we're looking
at the sequencing of events, right, the interactions of someone modifying it for someone else,
these processes of things unfolding and taking turns and all these sorts of -- there's this rich sort
of literature on sequences and cycles and processes, right? So if we instead look to these things
and ask a different kind of question of who modified whose work. So it's not that like A is
editing article X, but on article X who’s working with whom on that and we in particular look at
the sequence of events that user B modifies user A’s version, that user B is editing immediately
after user A, so that by implication user B is modifying user A’s state of the article. Instead, so
those relationships begin to look like this, that you have this Apteva is modifying Kennvido. He
made a change at a particular point in time. It may not have been editing, you know, this
particular set of phrases, but, you know, when he last touched the article this is what the article
looked like. This guy comes through and makes a quick change like the stuff used to be all caps
and so he's going to go through and say actually let's kind of do it more traditionally, how we
would write a headline like that. So the relationship here is that Apteva’s modifying so
Kennvido owned the article for a brief point in time, like that's what the article look like and this
editor comes along and modifies that state of the article modifying what Apteva had done
before. And if you look at like GitHub you can also see that, like a we've got people removing
some code and adding other types of code like this and so when previous person who sort of
owned this article for however brief time. Some other person comes along and modifies that
state of the article. If we look at that in sort of abstract sort of trajectory approach the
relationships all of a sudden isn't that A is editing X; it's that B is following A and C is following B
and then A is following C. D is following A and A is following D, right, so this is the sort of
operate sequence relationships that we look at. You also have this repeated sort of -- you've
got the same editors appearing or the same users appearing over and over again so we can sort
of collapse a stand to it, a graph, where there's only one kind of user, so like A only appears
once, and then B modifies A’s version and C modifies B’s version and then A modifies C’s
version and so we don't have another A that goes back to A. We create this cycle all of a
sudden that the narrative comes out of the fact that we're not sort of modeling each sort of
point in time at each individual edit, but that an editor owns a particular edit and that edit can
come back to you or go away from you, but you sort of are represented in the same way over
time. And so you have a cycle here where people, it came back to A. A just decided to jump
back into the collaboration. He wasn't happy with what C had done or he was happy and
wanted to keep building on it. Then D comes along and wants to modify A’s version and A still
doesn't like that and so what we begin to see out of this sequence all of a sudden is a much sort
of richer and heterogeneous picture of, heterogeneous picture of this sort of collaboration,
where it's not the simple star anymore. You got sort of rich, even from this toy example, much
more sort of complex network emerging. And then if we actually sort of applied this sort of
method to an article like Wikipedia, in particular, a breaking news article which is something I'm
very interested in. I've had these collaborations on Wikipedia articles about breaking news
events are different from other types of collaborations on Wikipedia. We look at sort of an
individual article and we use this sort of method, we all of a sudden have a network that looks
like this. Sure, it's a hairball, but it's also a much more complex and structured heterogeneous - I can't pronounce that word today for some reason. All of a sudden we have this network and
this is an article about the 2011 Japanese earthquake and tsunami. The nodes here, again, are
the editors as before and the relationships are if one editor modified a previous editor’s version
of that article. And so the nodes here are sized. How big they are is how many sort of other
editors work they've modified, but we can begin to overlay some more information on this.
We're going to dive a little bit deeper into this Japanese earthquake article. So this is the same
image as before and so I've only added the coloring of the nodes, but this isn't any sort of new
information that we didn't already have in the event log. This isn't sort of new independent
observation. What it is is just that we're going to color the nodes a bluer color if they edited
the article earlier and we're going to color the nodes in a gradient towards sort of redder color,
if they joined the collaboration and they started editing later. It becomes really apparent and
all of a sudden at the very early stages this article, these blue editors working together when
the article is very young. They're modifying each other's work in the first sort of hours or
minutes after this event occurs you see this very sort of dense pattern of work where they are
sort of passing the article around to each other and it's this very sort of dense pattern. You
have this sort of this very prominent central person here is a blue editor. He or she joined this
collaboration sort of very early, but then you see sort of these red editors out here and so these
people have a very different sort of pattern, right? This is sort of like a hairball dense clustered
highly sort of cooperative sort of work, but out here you've got these sorts of strings were
things get passed down and so they look like these giant sort of orbits. And this is mainly just
an artifact of the visualization sort of algorithm that I used, a kind of spring embedding. By
implication what it means is that these editors -- this guy sort of makes a revision and this guy is
modifying that guy’s version and this person is modifying that person's revision, but the fact
that we have this sort of string where like they make a revision, but then like no one else, all
these other people who made a revision who have contributed something to this article don't
decide to jump in and modify this person’s version. They leave it to some other newcomer to
come and modify that version. It says something really sort of deep about sort of the way this
collaboration’s unfolding. These people have left the collaboration and instead we have sort of
these new people coming in and modifying it but they can't be bothered to sort of continue in
the collaboration at a later point in time because if they continue to collaborate they move to
the center. They would be working with other people. They would be modifying other editors’
work as well. So we can sort of look at sort of different features of different things here. We
have the first editors here, so these are, again, the first editors who made the earliest
contributions to the articles, but they don't sort of persist. They worked together very early on,
but like as new editors came in and more work was being done, they're not here in the center.
These editors left, right? They edited at a very early stage and didn't continue to sort of persist.
We've got sort of a core group of editors who are editing very intensively with each other. And
this is another group of very early editors. Like I mentioned we've got these sort of peripheral
editors who join much later and are working together in a very different sort of way. Then
you've got this one sort of highly central person who is making contributions. And then you've
also got some other people out here who are sort of highly central but in a different way
because they are surrounded by different colors. Like this blue person is working with other
blue nodes. They are working with other blue editors. They are working with other editors
who joined relatively early. This editor is modifying the work of other editors who joined much
later and, in fact, if you kind of zoom in and look at what he was doing, it's actually not a he or a
she; it is a robot who is just modifying the work of vandals. These are people who joined very
late, so he joined relatively early because of this greenish color, but then the only work he does
is sort of modifying other sort of vandals, so he is reverting the work of other vandals that
people come by and blank the page or say, you know, that tsunami was awesome. They try to
sort of undo that sort of work automatically. But we can also sort of color the nodes a different
way and again, we're not introducing any new information. We know when the editors started
editing. We know when there is sort of a last observation that they edited, and so we can say
when did they sort of stop editing? When was the last observation of when they were editing?
Again, these bluer colors would suggest that they joined early but they also stopped editing
earlier. This person joined early as before but this greenish color says that he stopped editing
at sort of a medium stage, but you also have these others here who were blue before but are
now kind of red and they're are sort of the core editors. These are people who joined very
early and are still editing much later until sort of the present day. We begin to see a sea of sort
of the emerging social roles of people becoming custodians of trying to maintain these articles
over time. We can do other sorts of layout algorithms here as well. These attributes, there are
other attributes that we can specify for these networks. Because we're looking at these event
logs and we know how much time has elapsed between edits and that's kind of the optimum
relationship that we are looking at, how much time has gone on between these, we begin to
use that as another attribute that we can bring in. So these relationships that editor B modified
editor A’s work, that's editor B edited after editor A, but we can also say that editor B edited
sometime after editor A. We can put on like a latency. We can also go in and say what was the
size of that difference? How much content were they adding or removing? We can even look
at sort of the features of the content. So here's the network as we had it before. If we wanted
to have just a subset of the data and just look at what does this network look like when we just
look at relationships that are happening very intensely, very quickly. People modifying each
other's work in less than 60 seconds, then we again sort of unpack this sort of this backbone of
work largely among the early editors. We also see it among this vandal fighter here has a very
distinctive pattern of sort of quickly modifying other people's work. We also see that this very
sort of central person all of a sudden doesn't have that many strong ties anymore but we do
see sort of strong ties out here that these are sort of people that are editing very early on the
article’s history but it's also where the most intense work was going on. People are kind of
trying to work together very quickly and are modifying each other's work, you know, with less
than 60 seconds going by. Then we can also look at this in a different way and say let's just look
at the subset of the nodes, just like remove a bunch of nodes here, and only retain the nodes if
those editors have ever added more than, who ever added more than 50 bytes, if they ever
added a sentence worth of content. Again, these are all of the editors that have ever touched
the page and ever made a change to the page, but if we ask the question like who's added more
than like a sentence worth of content, all of a sudden this network becomes much more sparse.
So the work that all these people are doing, that these other sort of thousand editors are doing
is incredibly sort of micro sort of level work that they are changing particular sort of commas, or
they're doing stuff that's less than 50 bytes of work, and when you look at just the work of
people who have ever added more than 50 bytes of content to this article, you get a sort of
very different picture and a lot of the very central people that we saw before don't appear
there anymore. We still have two central people here. We're going to explore them in a bit. I
point out those two sort of people at the bottom and so just like I did before is we dove in and
we looked at the event log for a Wikipedia article, or we could do this for any sort of artifact
sort of event log. We can do the same for editors. We can do the same for users. We can say,
we can go in this case, this is my sort of edit history most recently, and so this is the history of
all of the stuff that I've been doing most recently and we can go back and look at all 10,000
edits that I've made to Wikipedia. But we can again, go through and say there is sort of a
relationship where I went from editing the article about John Riedl to editing the article about
the Sawicki talk page and I did that after the social network, so I'm kind of jumping around to
these different topics, but if we kind of aggregate that up and look at sort of the network of all
of these sort of relationships, what does that look like? Again, if I am sort of performer X and
instead of making changes to these different articles we can model it the exact same way
analogously, that I started editing article A. Then I jump to article B, then C and then maybe I
went back to A, and then I went to D and then I went to A. All of a sudden we get networks
that look like this, and so again, these are very unusual, some of these features are sort of we
don't usually see in networks. We're used to seeing these sort of core periphery, these complex
sort of hairballs and things like that, but what we see what's on these editors we see a very,
sort of different sorts of patterns. We've got this person was sort of that steward, that very
strong central person up there, so this user’s name was Flodded. So we can see where they
started editing and so the initial edits that he ever made to Wikipedia were about the 2011
Tucson shooting. Again, these nodes are articles. They're not editors anymore. The
relationship is if they edited one article after another, so his earliest edits are to this other sort
of breaking news event. He's editing this article about this breaking news event and then that
event sort of dies down and he starts editing articles about mobile phones and so he's kind of
like off in the wilderness. He's off in the wilderness like literally editing articles about Antarctica
and then the 2011 -- the Tucson shooting happened in January and the first week the tsunami
happened in April. This event happens in April, this person all of a sudden the lights back up
and he's really in the thick of editing this article about this earthquake and tsunami. The
articles that he's editing here I've colored in different ways. The blue articles are Wikipedia
articles that we see, so like the front page or the articles that 99 percent of us read and looked
at. The pink articles are the things like the talk page where you have discussions about what we
should do or shouldn't do. We've also got these brown pages which are just sort of more sort
of highly specialized pages about templates and sort of they're sort of infrastructure pages.
Then you also got sort of these turquoise pages which are really sort of the, the pinker ones are
sort of user talk pages, so he's talking to other users about what should be done. And then this
light blue article is actually the top page for that article. He's jumping around. He's talking to
other users about what they should be doing. He's jumping to the article about to the talk page
to try to get other people to work in certain way. He's jumping to this other article about how
do we sort of like get the number of deaths out and so on. This person is engaged in this sort of
very interesting kind of pattern of behavior that he did some breaking news stuff and then he
didn't and then didn't and then all of a sudden became very active and then did nothing else
after this. Then we can look at other editors too. There were these two other editors that we
saw down here. So if one of them is Sandpiper then we do the same type of method. We look
at his history of what he's done before. We get this network where he started off editing
articles about Harry Potter and then he jumps up and starts editing articles about the nuclear
disaster and the tsunami and then he starts editing articles about Cutty Sark which is a British
ship and also a kind of bad scotch. What's interesting is that the work that he was doing here
on Harry Potter doesn't seem to qualify him at all to edit articles about nuclear disasters or
tsunamis or things like that. Harry Potter is on the surface has nothing really to do with a
breaking news event, but at the same time Harry Potter is something that's always in the news,
right? It's like people are always kind of coming by and saying you should put the spoilers in
there. This article is, you know, should have more stuff on Hermione but not as much about
Harry, whatever, so people want to use it as a fan site or something like that. The work that he
does on the Harry Potter site is like being the dispute, sort of arbitration stuff. You've got all of
these light turquoise, blue articles so he's on the talk page saying that guys this is how stuff
should be done, like we should chill out. This is the rules. This is how things should get done
and so he sort of brings that expertise about how do we sort of like, something that's always in
the news and very sort of prominent, he brings that expertise and starts editing that on the sort
of the nuclear disaster and tsunami pages. So he is able to kind of migrate that expertise from
one area of Wikipedia to another and the flexibility and the ability to do that is really sort of
interesting. This sociotechnical approach he is able to sort of illustrate that these things are
highly clustered and very distinct for some editors, but they still make these transitions. Then
you've got this other sort of editor whose trajectory is sort of much more incoherent and sort
of more hairballing that we would see, doesn't have the sort of distinct sort of structure as
before, but we still see that there's some structure here that says that here is the stuff that he
was doing on the Japanese articles and then a lot of his work before was on environmental
organizations, international trade agreements, but again, you sort of see this migration of
expertise where he is very much an expert on nuclear policy and international trade
agreements about nuclear issues and bringing that to the nuclear disaster as well. I mentioned
before that this blue color was very sort of important and this is sort of my favorite case here is
that you've got this one editor here who only edits. He doesn't do any of this kind of
background coordination stuff. He just wants to edit articles that are like, just the content of
articles. That's all he cares about. After the tsunami you have all these towns and cities that
were wiped out and so he's editing all those articles. All the work that he did before was about
like Japan and Japanese pop bands and Japanese serial killers and again, nothing would seem to
qualify an editor who writes about Japanese serial killers and Japanese pop bands to edit an
article about a huge national disaster like this, but for the fact that he obviously is someone
who lives in Japan and has a lot of sort of cultural familiarity with, you know, what are the
towns and what are the landmarks that we need to update about this, and so we see again him
migrating from this early sort of work that he was doing before down to migrating the work on
these affected towns and landmarks. There are ways that we could lay this out again to show
that not only did he go from editing this stuff to specializing in this for a while, but that he
actually went back to doing what he was doing before. I haven't shown that here. So this is
just a single case study about the editors and the single sort of set of articles around this
Japanese earthquake and tsunami. But if we begin to sort of look up at another sort of level we
get a whole class of articles. In this case we will look at a class of articles about airline disasters.
If we do this sort of method that I described before, if we look there is these differences where
there's airline crash articles and Wikipedia if we look at the article trajectories or artifact
trajectories like which editors are modifying which other editors’ work, we can kind of see by
visual inspection there's two very distinct classes of networks that come out of this. At the top
we have sort of this characteristics of this is what networks often look like, what the core
periphery is sort of like the hairball. We've got kind of hairball networks on top, but the bottom
these are also networks but we've never seen networks like this in sort of nature or any sort of
social system like this, but these are also networks that we see in Wikipedia and the distinction
between these articles is that these articles on top were authored as breaking news events.
The event happened and within like a few minutes of the event happening people were
jumping in and trying to edit this article. These were articles about things that happened years
and years had passed, so that, you know, this airline crash happened in the mid-1990s, both of
them, so obviously there's no Wikipedia around so ten years went by before anyone got around
to writing these articles. As a result we see that the structure of these collaborations are very
different, the way that we produce the knowledge about these, they're the same sort of
content. These are both airline crash articles, but the way that this knowledge was produced it
was produced in very different ways and we see those structures very sort of clearly. We can
measure that with a variety of different sorts of like network measures that we can say we
want to sort of measure this hairballness of the networks. We want to say the one on the left is
more of a hairball than the one on the right, so we can operationalize that in at least four
different ways. There's many other ways we could do this as well. These are sort of very easy
to compute and easy to sort of understand methods. We say that short diameter is like how
wide is the graph. How many steps would you have to go between sort of different hops to say
from one node to the other at the longest, so the shortest longest path. The one on the left
should have a shorter diameter. These editors can just jump from one to the other. It would
be relatively easy to get from one to the other if you computed that statistic versus this person
you have to jump through every single editor. These are extremes, obviously, but they are
illustrative because this one has a very short one and the one on the right would have a very
long diameter. Closest centrality would be a measure of how close are nodes to each other.
Can they sort of reach many other nodes in a few steps? And if we measure that across and we
average that across all of the nodes in the graph that, you know, this one on the left the editors
are sort of more close to other editors in general. The one on the right, these editors are
relatively far apart on average from other editors and so the one on the left we expect there to
be more closeness. The one on the right there would be less closeness. Between this there is
this sort of interesting idea like brokering that you kind of connect editors who would otherwise
be unconnected, but here we can think about between is not as sort of this classic idea like
brokerage, but instead is like a measure of fragility that if you like took out that node, like
would the graph suddenly become disconnected in a average sort of sense. The graph on the
left, you know, there's sort of overall low betweenness because there's no sort of single central
node that is like a point of failure versus the graph on the right almost every single node is a
point of failure. To get any one of them they all have to work very hard in between us in
neutrality. Then finally clustering, this one is more clustering, more of its editors work with
other editors who themselves have edited with. The one on the right no one is editing with
anyone else except the person immediately before them. If we look at sort of we do a
statistical model and we ask are these sort of breaking news articles different in aggregate if we
look across this whole set of articles about airline disasters we do see that depending on how
long it takes for that article to be created, we do see that there are shorter diameters. There is
more closeness. There is less betweenness and more clustering, so across all four features we
find evidence that, in fact, these breaking news articles have more of this hairballness pattern
than non-breaking articles, right? So these breaking news articles are, in fact, structurally
different from non-breaking news articles and that's largely a function of how much time it
takes for those articles to be created. Then we also have this idea, these trajectories also
capture something about the temporality of these collaborations. If we look at some of these
editors joined and start editing at a particular point in time, the collaboration has a
characteristic sort of pattern. In the early stages maybe it's highly clustered and looks
something like this, where again, these are editors who are working with each other at a
various early points in time. But if we look at sort of the collaboration much, much later, this
again is the Japanese example that has perhaps a very different sort of structure. But even
though it's the same article, so the way that they are collaborating on these articles may itself
be changing over time. That is as time goes on they may look sort of highly clustered and
hairbally very early on, but later on it may, in fact, they revert back to sort of like how nonbreaking articles are edited. So we can, again, sort of estimate a model down here and say that
is it the case that the way that -- the structure of these collaborations at a later point in time if
we just subset that data, do those collaborations later on look very different from those
collaborations sort of very early on. In fact, we do. We see that they sort of regress to the
mean, that they begin as time goes on these breaking news articles, this clusteringness of them
only occurs in the immediate aftermath in days zero through seven, but if we look sort of later
on these articles begin to get longer diameters. They begin to have less closeness so you can
have more betweenness and they have less clustering, so here we again find evidence that this
way of collaborating by using the sort of trajectory method we find that this way of
collaborating is very sort of unique to the immediate aftermath of these events that we see
that sort of effect begins to go away as time goes on and they begin to look more and more like
normal collaborations. The differences actually do not persist over time. This last case I want
to talk about here is we've talked about a single case about an article and some of the editors
who contributed to it. Then we looked at sort of this class of like all of these articles about
airline disasters, but if we begin to look at like a set of like 3000 articles about hurricanes, if we
look at articles about airline crashes, if we look at about like accidents, industrial disasters and
wildfires and health outbreaks. We look to really broaden the scope to include all these
different kinds of articles on Wikipedia, we can ask some other sorts of questions. We can
begin to imagine like if we layered these trajectories one on top of another, so before we were
only looking at trajectories in like within single articles, so within the Japanese earthquake
disaster article, for example, right? But what if we said well there's a set of editors who edited
this article. Well maybe some of them also edited articles about airline crashes. We can like
layer these sort of trajectories one on top of each other and like look at this union or the
intersection of these and what does that sort of trajectory look like? And so we can say that on
one article maybe editor A works with and modifies the work of editor B. On article two editor
A modifies the work of article B. But these relationships on article one and article two both
occur on the breaking news articles, so we can say these are the same editors. They actually
have two different kinds of relationships and we can collapse those two different kinds of
relationships into a single relationship where there's a way. They've work together on two
different articles, but on those two articles both of them, or 100 percent of that work was done
on breaking news articles, so we give it kind of like a redder color. So we can color these edges
by different sorts of colors and say maybe the bluer ones are non-breaking articles so they work
together on a breaking news article and a non-breaking article and so they still work together
on two articles. We'll give it a different color because it's a mix of different types of articles
here and so if we again run this across all 3000 articles, we get a nice picture like this. Again,
these kind of gray nodes here are the editors and these are editors who interacted across two
or more articles and so this is sort of the zoomed in giant component of this. The redder colors
that connect some editors to other editors means that they interact a lot on breaking news
articles and if they have a blue color that means they interact exclusively on these non-breaking
articles, right? And then we've got sort of these colors that are light green which means they
sort of interact on a mix of like breaking and non-breaking articles. And so we see there is, in
fact, a strong sort of core of editors here who edit across many different breaking news articles.
They worked exclusively together on breaking news articles given this sort of red color and then
there's other editors who appear to work together repeatedly but on non-breaking news
articles and so on. But this cluster down here is sort of really interesting and illustrative. So
these are editors, right, who are working together. They, we've got records of them modifying
each other's work across so many different kinds of articles, but it's extremely dense. It's
almost as dense as we would see up there. But it's also so removed, that they are not actually
working with those editors or modifying those editors work; they are modifying each other's
work across many different articles but not necessarily modifying these other editors’ work.
They kind of just work by themselves and they are relatively isolated, but not completely. But
it's also this interesting color, right? They are not working only on breaking. They are not
working only on non-breaking or there are some who work only on breaking, but there's a lot
here who work on a mix, almost a perfect mix of breaking and non-breaking. So these are
editors who added stuff about tropical cyclones. So this is like a really interesting sub
community on Wikipedia where Wikipedia has these wiki projects where people sort of get
really excited about editing articles about military history, roads and bridges, and so there's a
group of folks who just like to edit or create articles about every single hurricane or tropical
cyclone that's ever existed. So they write this in a very formalized like standard way that this
storm developed as a squall off the coast of Africa and it moved off and became like a tropical
depression, with this kind of barometric sort of characteristics and so on and so they have just
like this sort of like almost like perfect way of just writing these articles, just copying and
pasting them almost, and they are filling in like what changed for these articles. And so what
we see is that these people are editing. They kind of show up in my breaking news corpus
because obviously there's breaking news articles. These hurricanes are breaking news articles
often but these people are also working together on, because they care about hurricanes. They
don't care about breaking news articles, so they are also editing these other articles about, you
know, hurricanes that happened 50 years ago, but they are working together and so what this
method has really uncovered is by identifying these trajectories overlapping in this way we've
highlighted like a community practice and this is what we would call a community practice in
the very sort of traditional way. This is people who are like kind of jointly attending the same
sort of topic and there's ways to join the community and direct boundaries that they are not
sort of part of the rest of the community. They are engaged in a particular type of work that
they care about that is differentiated and so on. And so these trajectories approach that I sort
of propose and I think that I've tried to demonstrate here has, we've gone from this very
abstract idea to sort of these event logs where sort of we can see that if someone edited after
someone else and we've arrived at this place where all of a sudden we can begin to sort of
uncover these sort of really complex and these sort of anomalies in this data by using this
method and so I just want to wrap up here with a discussion. So that these event logs are ways
to sort of narrate changes to artifacts, Wikipedia articles [indiscernible] repositories and users
over time. These are the history of someone becoming an identity or an article sort of
becoming more complex or something like that, right? We can go through and look at in
previous points in time of what these articles, what these artifacts, what these users looked at
and are doing at previous point in time. These event logs are pervasive through lots of
sociotechnical systems. We see them in GitHub. We see them in Wikipedia. We see them in
Scratch. We see them in Reddit and so on. We can go through and find these data almost
everywhere and yet we don't sort of realize the potential that we could begin to build
relationships out of these, right? That if we look at these sort of temporal adjacencies, the fact
that I work immediately after Scott who works immediately after Andres, all of a sudden we can
begin to sort of say or do some interesting things about creating these trajectories that have
these really sort of complex and interesting behavior, right? They collapse this sort of very
large-scale data into these sort of nice sort of graphical representation that we can then sort of
do traditional sort of social network analysis things on, right? And that they captured this idea
this Stigmergy. We've got this accumulation, this gradual accumulation of effort of individual
contributions create these sort of very emerging sort of complex structures. Again, we're not
sort of looking at sort of the content level features that we were working on the same sentence
or anything like that. It's just the fact that I edit after Andres or something like that, but by
aggregating these sorts of interactions altogether we could be sort of very interesting and very
complex structures that are nontrivial and they reveal these sort of, the patterns within these
trajectories begin to highlight sort of the social and the temporal context of work, right? That
it's, we see that in the community practice with the sort of, the hurricane editors. We see that
they are sort of more active editors are not sort of deeply involved in -- they've joined the
article very early, right? They had the blue editors. They joined very early and they stopped
editing, right? There are just sort of this interesting sort of behaviors that we capture and this
context of work that people were involved and motivated to work very early but weren't, in
fact, motivated to sort of continue to work going forward. And so I think what's most sort of
powerful here and this is probably like the hardest one to read, but that trajectories also
exposed sort of regularities and anomalies for sort of following faulty analysis where we see like
sort of the regular pattern how these networks behave, but then we see these anomalies, that
there's people who are like highly central. Like what does that mean? Like those editors are
coming from somewhere. They had some work that they had done before. They had some
work that they were doing afterwards as well, so we only capture that in the single snapshot,
but then we can begin to sort of unpack that in a much more sort of rich way, where we see
these anomalies like this sort of group of editors who were working together but weren't
necessarily working strongly with a lot of other people on all these different breaking news
articles. That's an incentive to kind of dive in there and look deeply at what's going on in there,
and so we are able to unpack from these trajectories these sort of interesting anomalous sort of
structural features that we can then begin to sort of use, then motivate sort of follow with
qualitative analysis and so I think this approach, the sociotechnical approach that I've discussed
here is an interesting sort of like mixed method sort of thing, right? That we can both use it to
sort of mine, like data mine on an extremely large-scale sort of databases that we have that are
pervasive and all these different event logs that we have, but we can use that sort of collapse
that down to something that is relatively easier to comprehend and make a network out of
that. We can begin to sort of run the network analyses on that and then we can begin to see
that there are the anomalies that then we can use to motivate some interesting qualitative
analysis and sort of tie these things altogether and really understand the complexity of these
systems. With that, I'll wrap up and I thank you guys for your attention. I'll take questions.
[applause]
>>: I've got a conceptual question.
>> Brian Keegan: Sure.
>>: You are sort of framing this whole entire thing as collaboration between people, but then
the relationships we're actually looking at are proximity in time, so like you might edit an article
that I previously edited and then I might never go back to it. Are we really collaborating on
something? It seems to me that this is more about like participation, so like what distinguishes
breaking from non-breaking articles is the patterns of participation, not necessarily
collaboration.
>> Brian Keegan: Right. So there's probably inprecision in like the language I use and I think
that we can sort of split the hairs here about what sort of coordination, collaboration and all of
these sorts of things, right? But it's important to kind of separate all those things out,
absolutely. One thing I think you brought up though was the fact that, the fact that these
relationships are actually implicit. Like in social networking analysis we often assume that like
we make an active choice to e-mail someone or to talk to someone like that's at some agentic
like I decided to talk to you but not someone else, but in fact, what this method is uncovering
our these like more implicit or like unassumed ties. Like I might be modifying an article not
know that it was your version of the article, but and so and that’s maybe just purely
coincidence. Maybe that's all we're capturing, but if there was sort of really all pure
coincidence then you would just sort of see these random graphs where nothing is sort of
happening, but what we're actually seeing is something very different where we see that
there's these patterns of activity, like where you are engaging and then I'm engaging and the
fact that you don't engage anymore says something about how this collaboration’s working,
that you chose not to engage anymore. You are not sort of paying attention to this thing that,
this isn't something that you care about potentially. It could be that it captures something
about do we pay attention to each other’s sort of work as well. And so I think that in the sort of
very micro level, like a lot of this may just be sort of coincidental sort of ties, right, but I think
that by sort of, there's lots of data here. You begin to mine that up and you would begin to see
sort of that begin to wash away and so there might be like a baseline level of like randomness in
this sort of approach, but we begin to see these sort of structures emerge that suggests there's
something very much non-random and non-coincidental about these ties.
>>: Yeah, so my second question. Have you, you start out very motivated [indiscernible]
sequence analysis kinds of things. Have you done anything like what's the whole set of network
[indiscernible] like [indiscernible] events?
>> Brian Keegan: I haven't done anything like that, so that's kind of like the next approach here.
So having just on this admittedly just purely descriptive sort of work here and saying like you
can create these kinds of networks using this sort of approach may be very complex features,
but all of a sudden we've got networks now. That looked like networks that we often see in
other sorts of contexts, so now let's begin to sort of bring in these other kind of methodologies
that we have in social network analyses, do some statistical modeling that we can look at
what's the likelihood that some people creating ties with each other and so on. How do these
things sort of evolve and can we create sort of like Markov chains like people making decisions
about when they sort of jump in there and so on, yeah.
>>: I think that perspective could be really interesting because like the sequence of ties is
explicit in that [indiscernible]. That's where you are kind of obscuring the timing of the edges,
not really, because like forward past or forward time, right? They are all kind of in one thing
together versus not the kind of framework where you think about like -- yes, a network but
sudden nodes where edges occur discreetly like overtime.
>> Brian Keegan: Right. So these networks are sort of unique. I don't have like a graph
theoretical like mathematical background to explain exactly what the right word would be, but
like they are unique because you can actually follow a path on that network which is the editing
sequence. You can like walk this graph and show like how sort of who has touched it when,
right? And so that creates some biases may be in terms of what kind of ties can or can't exist,
right? But there's, I'm thinking people will want to fight through how to model this. That's
exciting.
>>: That's a whole other thing.
>>: I totally dig the approach of getting down into sort of the action in the network to sort of
bridge the bare bossy Benkler gap. Very cool. And also, sort of apply [indiscernible] and it's
hard to kind of scale that up, right, because you are focused on sort of individual level actions.
It's cool you were able to scale it up to multiple thousands of events. The question then is just
coming back to the, you know, how does it relate to Benkler’s like what's important to society
and so I'm just curious for your thoughts on what do you think are the tightest
operationalizations of this work to the this sort of Benkler’s asking what's important for
society? Like is it about we can understand how different networks in terms of this sort of
activity, the editors, enable societal level self reflection, or efficiency or sort of information
spread or, I mean what are the, I don't know. Does that make sense?
>> Brian Keegan: One thing that was motivating and maybe it's that roundabout way of getting
to that was that so Benkler talks about these sort of forms like networks like knowledge
production, but when we talk about like a single artifact like how that's produced, like we have
no way of kind of unpacking that, so I think that this is one way to begin to see that, in fact,
different artifacts are produced in different sorts of ways and so this complicates the idea that
like we just call it a network. That's all of a sudden like we can just use that as a framework to
understand like these things are all different, that actually within that sort of network form of
knowledge production are actually sort of really interesting differences in how those things are
done. And so then we want to get in sort of the societal level questions and it's like you can
begin to get a question like sort of efficiency, like are there some ways, you know, was the work
that's being done that's highly clustered here really redundant? And maybe it is because we
saw that not a lot of like effort -- there was only like a minority of people who were doing most
of the heavy lifting, were actually making the biggest changes to these articles. In fact, the
work there was sort of a capture of that modularity and granularity that he talks about a lot,
right? But to get to that question of sort of those outcomes that we care about, I mean it might
bear on sort of how would we go about designing systems or designing kind of feedback
mechanisms that sort of reward people for assuming certain kinds of social roles in these sorts
of collaborations to incentivize them to sort of assume roles that make them more sort of
coordinated or whatever interactive [indiscernible] different kinds of editors are highlighting
the fact that they are engaged in a particular way of collaborating and not, or maybe that is
problematic or it's not in that. And so getting back to the sort of Benkler question that, you
know, we usually provide that feedback like monetarily or something like that, but here we
begin to use like mine the information, mine your history to say something about and valorize
that in a certain way that you are engaged in a particular pattern of collaboration that we know
that that leads to these particular outcomes that may be I differentiated on breaking and nonbreaking, but maybe some patterns of collaboration lead to very high quality articles or lowquality articles and we want to replicate that. Maybe that's just a feature of how we generate
high-quality knowledge artifacts, that we see maybe this hairballing is like why some
collaborations are successful or not because we are part of a community that I know that
Andres is looking at what I'm doing and I'm looking at what you are doing and so we feel like
the work that I am contributing isn't kind of going like off into the void, but like other people
are going to sit down and modify it as well. So in all those sorts of ways I think that we can, this
framework suggests ways that we can both study different ways of like how valuable sort of
these artifacts could be to get that outcome question, that there's differences that maybe these
structures might predict different kinds of outcomes, maybe predict different sort of quality of
these artifacts.
>>: [indiscernible] one of the focuses of his work is to come up with like [indiscernible] patterns
of [indiscernible] systems [indiscernible] statistical. From the specific cases that you looked at
on breaking news, what are the kinds of take-home message is that you give to say if Wikipedia
were to be reassigned to just focus on breaking news. [indiscernible] like what's not
[indiscernible] why are they most important [indiscernible] you pulled from some of the work
that you were doing on breaking news?
>> Brian Keegan: Sure. So I think that the examples of the editors that we looked at there are
the people that are editing articles about Japan or the editors about Harry Potter, I think that
sort of traditional organizations, the stuff that, you know, Benkler sort of critiques is that
traditional markets are higher so he would say that you are not allowed to do that. You
shouldn't be able to translate that expertise to someone else, somewhere else. You have to go
through an administrator. You need accreditation or you need someone to say that it's okay for
you to go over and edit these kinds of articles. And so in this network kind of way, we could
begin to see the sort of the flows of the expertise in these new domains that you wouldn't
predict would be it. So what's hard then is like you might be able to sort of create a
recommender system to say that oh is looks like you like editing talk pages a lot and it seems
that you are a very good sort of discussant or sort of arbitrator or something like that. We are
having problems here. Would you mind coming over here and like helping out with this
discussion thing? And so maybe a design lever around sort of how to we do match sort of tasks
to people better by looking at these sort of patterns of how they worked before, the kind of
work that they engaged in before? So we could begin to maybe do some sort of matching like
that. Does that cover that, yeah?
>> Andres Monroy-Hernandez: Any other questions? No. Well thank you very much
>> Brian Keegan: Thank you guys. [applause]
Download