Talk Transcript (Word)

advertisement
>> Lev Nachmansen: Everybody is familiar with his similar work on treemaps and many other
works that have to do with information visualization and computer -- human computer
interaction. And of course he has a long list of awards that I cannot talk about because if I say all
of this then we wouldn't have time to hear his speech.
However, I have the privilege -- should I introduce it now?
>> Ben Shneiderman: Please.
>> Lev Nachmansen: I have the privilege to announce that he received the IEEE Career Award
on visualization, which will be awarded in the InfoVis next week.
>> Ben Shneiderman: That's right.
[applause]
>> Ben Shneiderman: All right. Thank you for that kind of introduction. I'm very, very pleased
to be here. I've for years followed the Graph Drawing Conference, but I don't think I've ever
been. I was invited to be on the program committee. So I'm very pleased that you're seeking the
interactive element as part of this community.
My background isn't very much in this framework of algorithm design for database and file
strategies and indexing and search, and those kind of algorithms still to me are the heart of
computer science, but I become what I say is 20 percent of an experimental psychologist in
trying to study the way people use technology.
So I was pleased that at least one of the talk had some empirical results about people, not just
empirical results with data.
And so I'm here today in appreciation and recognition of your work, but also I hope to change
your position a little, your attitude, and maybe shift your attention towards the real opportunities.
Because 20 years ago when graph drawing started, it was a very different world. Networks were
a rare beast. It was hard to get the data. There weren't that many people who were interested.
Now suddenly we're surrounded by social media, and the opportunities and the demand and the
pressure and the interest in visualization overall, the number of blogs and the cultural phenomena
that's become visualization, is startling.
So I'm very pleased with that. So maybe I'll just plant one idea in your mind; that maybe graph
drawing should rename, still keep a GD, but call it graph discovery. Because the idea is
discovering and making insights. The purpose of drawing graphs is not pictures, it's insight.
And that's what I hope to show and promote.
But first I have great appreciation to the organizers, Morizio [phonetic] and Walter, and a copy
of the book, which I've signed for them, so ->>: Thank you very much.
>> Ben Shneiderman: Very much appreciate the opportunity, and thank you very much.
So today is meant as kind of a review, and you can look in the paper for a more detailed analysis,
and I'm pleased that Cody Dunne, a Ph.D. student working on these problems, will present the
latest -- last part of the talk about his work.
So also I'm proud to represent the human computer interaction lab, which this year celebrates its
30th anniversary. And I gave up being director to Ben Bederson and then Allison Druin, and Jen
Golbeck's now the director. We are supported and administered by both computer science and
the College of Information Studies and many relations around campus with different
departments, including the wonderfully titled Maryland Institute for Technology and the
Humanities.
So these humanities applications are increasingly interesting ones. In fact, I'm working with a
classics professor who has the social network of Alexander the Great, over 32 years of his life as
he traveled around and 650 connections, and it's a fascinating story, and how do we make a
visual representation of that social network is the kind of challenge that she's asking.
So MITH is the Maryland Institute for Technology and Humanities. If you visit our lab's Web
site, you'll find 650 technical reports, 200 videos, 40 pieces of software, and lots more about our
projects.
I hope you know me from the book Designing the User Interface, which is now in 5th edition.
It's written -- my coauthor for the fifth edition is Catherine Plaisant who's been my collaborator
for 25 years, and the work you'll hear about has partially come from her, but also the wonderful
graduate students that I've had the pleasure to work with over the years.
The story for you here is to recognize that when the 5th edition came out it had a whole new
section on social media. In 2004, when we did the 4th edition, there was no Twitter, there was
no Facebook, YouTube was small, Wikipedia was just starting. And now all of a sudden we're
surrounded by social networks. If you haven't heard, that's the hot story around.
And so visualization has also gained a separate chapter, and so those are the important issues.
With Stu Card and Jock Mackinlay we tried to lay out the basis for this new field. This is a book
from -- that goes back a few years now. And Stu Card gave the title Using Vision to Think; that
visual representations are not just a representation but it's a way of solving problems. And that
was really the significant point; that within 400 milliseconds, if the interface is correctly
designed with color, shape, size, and proximity, then you will be able to spot clusters, gaps,
outliers, and trends in that short amount of time. So there's many implications, and we've
explored that. This book collects 47 papers from different sources and 60,000 words of our own
work.
I think I just can't resist telling you about Spotfire, our early work. The paper was in 1994 at the
CHI conference, remains one of the most cited papers and led to the company formed by Chris
Ahlberg in '97 which grew to 200 people by 2007 and was purchased by Tibco. So it was great
success story.
Here we're looking at 15,000 births in Washington, D.C. The red dots are girls, the blue dots are
boys. The age of the mother is over here. You can see they go from about 12 to 50. The age of
the father from about 13 to 65. And you get to see many, many patterns, these multiple
coordinated windows. And the dynamic query sliders were the key features of that invasion.
Spotfire has grown to be a place to analyze -- tool for analyzing large complex datasets, and one
of the lessons we've learned that's appropriate for today's talk is that one single visualization is
not the way to show a complex amount of information, but here are 27 windows that are
coordinated and so that if you filter, it filters everywhere. If you select in one window, it
highlights in the other windows. That's the way to deal with complexity in data, not by trying to
pack everything into one screen.
Okay. So the visual world is getting richer and richer, and here you see examples of the kind of
environments people are working in to increase productivity, make better decisions and
understand the world around us.
And, as I said, there's just a rich cultural phenomena around -- just last night I saw there's a
new -- there's two new blogs and a new conference. The New York Times will have a conference
November 8th and 9th in New York called Visualized, which brings together 25 designers who
are look at or making creative visualizations.
Control rooms with lots of visual information and collaborative environments are becoming
more and more the way realtime decisions are made. People, this is -- the counterterrorism
center.
Also on small devices we see increasing use of visualization, and that's become another popular
phenomena. Can even see some treemaps over here to get an idea of what's going on.
So we learn from that. I wrote down one day in a very playful way and called it the information
seeking mantra, and I wrote it in this paper, 12 lines, each one represents one project where we
struggled for weeks or months to find the right design, and it turned out to be show the overview
first, even if it's a million or a billion items, so the users can get an understanding of the range of
data, the clusters, the size of the clusters, the gaps, the outliers, and so on, and then allow the user
to zoom in on what they want, filter out what they don't want, and click for details on demand.
And this has collected almost 2,000 citations, which is kind of amazing, and people who use it,
people who contradict it, people who extend it, people who make jokes about it, so it's gained its
own kind of little phenomena.
And I think what people like about it is that it asserts this neutrality of human decision-making -[phone ringing] that's embarrassing -- the neutrality of human decision-making where the user
gets the overview, the user zooms in on what they want and then filters out what they don't want.
So we're not talking about algorithms or data mining. We're talking about a process by which
users make decisions, make discoveries and make insights.
I was very pleased, for example, in March the White House issued its statement about big data
and its expenditure, about $220 million in this country from seven different research agencies,
and I had some influence, I'm pleased to say, but in that three-page press release the word
"visualization" appeared five times. The words "data mining" did not appear at all.
So we're seeing this sort of shift in understanding that visual analytics and visual approaches are
the way people make discoveries and that we support discovery by people aided by rich and
powerful statistical methods; that integration of statistics and visualization is what I really want
to stress with you.
So a little bit of my way of seeing the field. We have the traditional field of scientific
visualization, has 50-year history of including geographic information systems and medical and
architecture and so on. These are great success stories, especially if you go to Hollywood
movies or play video games.
But the story that I'm talking about is here. Multivariate data, where Spotfire has been joined by
very effective competitors like Tableau and many other tools, temporal data series, tree
structures, and seems many people know about treemaps, so I'm pleased about that.
And then I save for myself in my work networks for last, because they seem to me the most
difficult aspect of the work; that is, by -- when I think of networks, I think of nodes and edges,
but the nodes may have many attributes and the edges may have many attributes and the
problems we have to ask against those networks are very complex. And so that I felt was a
substantive challenge.
And so I've become more and more devoted to this issue, especially because the social media
have produced such huge resources and such important questions that we need to understand, not
just for entertainment or e-commerce, but also for important national priority, such as disaster
response, health care, community safety, just so many ways that the benefits -- I noticed outside I
think there's a sign left from yesterday where Chris Dockus [phonetic] was speaking here.
Maybe someone attended his talk. But he's said the key figure of Harvard Medical School has
promoted the notion that if you study the networks of patients, you will find out that patients
become obese if their friends become obese, they lose weight if their friends lose weight, they
stop smoking if their friends stop. And the social networks determine these medical outcomes in
a way that's remarkably powerful. In fact, so powerful that there are many sceptics of Chris
Dockus' work.
We've run this Summer Social Webshop with 50 doctoral students around the country twice now
and been a great success story, and we're just happy to continue that.
And I just want to end the introduction by saying I hope you will think every time you go do
your work that some way you're contributing to these important priorities of not only national but
international, and I like to use this as an illustration, the goal set out by the United Nations in the
year 2000 of ending poverty and hunger, universal education, gender equality, child health,
maternal health, combat HIV/AIDS, environmental sustainability, and global partnership.
In some way I want my discipline and the work I do, and I hope you devote yourself also, to
working in ways that your work gets applied to making the world a better place.
Okay. So we turn now narrowly to focus on networks, and hope some of you know this
wonderful Web site by Manuel Lima called Visual Complexity, ironically called Visual
Complexity. He has a new book out called Visual Complexity that I think you might want to take
a look at, beautifully produced book that shows many network drawings. And he has 772
examples of network systems and endpointers to those working tools. And as you can see many
of them are very colorful and very beautiful, but many of them are also a mess, and the usual talk
of hairball or bird's nest or spaghetti is what we see.
So some of them are beautiful and we might admire them, like Hubble Telescope photographs,
and we can say something about the clusters and the size of groups here, but it's pretty hard to
make sense of it. Some of you might want to frame it and put it on the wall, but I'm not sure if
you can make any insights or discoveries in which you would make a decision to change things.
And some more examples, these tangled messes where you cannot see what's going on, there are
some labels, but you don't even know what the labels are connected to, et cetera.
Okay. So one time to continue the mantra idea I made this little phrase of NetViz Nirvana. Our
goal I would say for network visualization is that every node should be visible. I think you all
agree with that and the metrics developed, Peter. I should say it's great to be in the room with
heroes of mine like Peter Eades and Milor Brandis [phonetic] other leaders and Roberto Tomasia
[phonetic] and others. And actually all four authors of the great graph drawing book are in the
room together, which is quite a wonderful thing. And also new younger stars of people who are
working and doing great work in this area.
So, I mean, the idea that every node be visible is pretty common in this community, and there are
metrics for visibility, et cetera, but for every node you can count its degree, for every link you
can follow it from source to destination, and for the cluster you can even see them all and maybe
see their sizes and also spot the outliers.
So I wrote this down in a rather playful way, but it's become a pretty important thing. And like
nirvana, it's never really attainable. We're not always attainable. But it's something we should
strive for in order to make graphs visible, comprehensible in a way that people can make insights
that they can depend on, that they can make a decision, that they can commit action to.
So here's the outline for the talk. There are four methods I want to talk about. These are all
interactive and dynamic approaches that we have been developing and refining inside the tool
NodeXL. That's the book that I handed out, and you'll see more of that.
And their basic ideas of filtering, the dynamic filtering queries are alive and well in NodeXL,
double box sliders by which you can filter out the low edge density or the high edge density or
both, or you can look for the high eigenvalue centrality or low eigenvector centralities. All these
different metrics are built in. And then we'll look at clustering, grouping, and motif
simplification. So that's the goals here.
And in a way I see this as the beginnings, the beginnings of a process model. What do I do first?
Well, first I want to filter to look at a simplified graph. Let me try that, see what I can learn from
filtering, then let me try clustering, see what I get from that, grouping, maybe grouping first or
clustering first, and then we'll see about motif simplification.
Okay. So we just start, and we'll just take quick examples of these. There's more examples in
the paper.
So here is a great story that came to us from a practical problem. A journalist named Chris
Wilson working in Washington, D.C., for Slate Magazine wanted to analyze the senate voting
pattern. So there are in the U.S. 100 senators, and he had the data for the year 2007. And what
he was trying to look at is the similarity in voting patterns. Okay. So the strength of each edge
is an indication of how many times they voted the same way on a bill. Okay. So if there are a
hundred senators, how many edges are there?
>>: [inaudible]
>> Ben Shneiderman: No, no. Not N squared. 100 choose two, which is?
[laughter]
>> Ben Shneiderman: I'll wait.
>>: [inaudible]
>> Ben Shneiderman: Let's see. This is -- who is this story? This is ->>: [inaudible]
>> Ben Shneiderman: 4950. Okay. 4,950 edges. And of course it therefore forms a very dense,
packed area. So that's not going to help. If you see all the edges at once, you really can't see the
patterns. But if you filter out to show 65 percent similarity, you get this lovely network. Okay.
And the blue democrats are here and the red republicans are here, and in the middle we have
three senators: Olympia Snowe, Arlen Specter, and Susan Collins. And this is 2007, so they are
closer to the democratic position, they have a stronger relation. This is Fruchterman-Reingold
layout on top of the filtering.
And so it really showed a dramatic result. And it was remarkably predictive because two years
later in February 2009 three republicans crossed over to vote for the Obama Stimulus Bill, and it
was exactly these three senators. So you can see quite a lot. And if you look carefully, you'll
find that the more liberal, progressive democrats are over here, the more conservative
republicans are over here.
So we did a lot just by filtering, but it's still not perfect and you'll see how we do even better to
work on this graph.
And this is -- shows you first example of NodeXL. It's embedded in Excel as a template, so its
strength is that it's easy to use, it's free, embedded in Excel, ends disadvantages. It's embedded
in Excel, which has many limitations, and so it has these problems. I mean, the benefits are -we believe that we are -- with the book are trying to promote the democratization of social
network analysis, to allow many more people to do it. It does not require programming, does not
require advanced work, and in a couple of weeks in a sociology and a political science, or
computer science, in my class I have a three-week section, and my students do very ambitious
projects inside NodeXL. So you might want to try it, free to download. We've had 125,000
downloads, so you can join that crowd and take a look at it.
Okay. So that was simple and, you know, many other examples of filtering. It's an easy concept
and it's just easily done with a slider inside NodeXL and you can filter by any of the metrics of
the nodes or the edges and -- so we're going to look more at clustering. Clustering I define here.
There are many terms of aggregation clustering, grouping we'll see, simplification
summarization, meta nodes, we still have to get our language organized.
Graph theory has been marked over the years by a failure for consistency and terminology of the
battling language of different groups that make it hard to speak. But we'll see if I can convince
you that.
So here is a graph from somebody else. This is the network of people -- of actors in the play Les
Misérables. And there's an edge between any pair of actors if they appear in the same scene.
Okay?
Now, you can't make too much of that in here. And even if you look at it that way, there's not
too much that leaps out. So that's really a bad example and it doesn't satisfy the NetViz Nirvana
principle that all the nodes are occluding one another, the edges are impossible to follow, you
really can't see the structure or the cluster.
So inside NodeXL -- this is one of the examples from the book -- we do a little bit more of color
coding by clustering. We have three different clustering algorithms built into NodeXL. If you
want to add a fourth one, please help us by extending NodeXL and adding the code for yet
another one.
So we can see immediately the main character, Jean Valjean, who appears in many scenes, and
then there are various, like Fantine, some of you may know the play, Javert, our key players, and
then there are some groups that appear only in one scene. They form a click over here.
And so we have other -- another small clique. They appear only once and they appear in the
same scene and that's the end of it. So you can understand they're not very important, although
usually you think of cliques as important, but here they're relatively low importance.
And then you can see other clusters of characters who have strong ties and connections with each
other, and the thickness of the edges shows the strength of the connections between them.
So the clustering here illustrated by coloring gives you some help in understanding and making
sense of it, and, again, size, color coding and so on. And then in here we label only the key
players so we don't clutter the screen with other information.
This was another wonderful success story. We also -- NodeXL has importers for networks from
Flickr, YouTube, Facebook, and other sources, graph NL and many other formats. You can just
type it in or cut and paste it. If you have an edge list, you can just cut and paste it.
This was just another -- this was a remarkably good application of clustering where we took all
the photos in Flickr that had the word "mouse" in them, and then by the color codings of other
terms, we created the linkages, and the clustering algorithm did a perfect job. And so you see in
natural language processing this is called word sense disambiguation. And so here the yellow
cluster is exactly the computer mouse, the blue cluster is the animal mouse, and the red cluster is
Mickey Mouse. And so it just turned out to be a really nice example where clustering worked
very sweetly. It's not always so fortunate that clustering works out so well, but here was a good
example.
We also began to study and see the patterns of clustering in popular data. This is the Twitter
stream of all the tweets at a certain point that had the hashtag "GOP." In the U.S. GOP stands
for Grand Old Party, which is a short form for the republican party.
So these are all the people who used hashtag GOP in a tweet, and the clustering showed a large
cluster, which we did color red for the republicans, the traditional color, and a smaller cluster of
blue for the democrats. The red cluster is much more dense. There's many more of them.
They're thickly connected. And there's a high density there. And there's a fewer number and a
less dense connection over here.
And you can see the bridge between them is relatively mild except for one large node. These are
between the centrality-sized nodes, and that one node is the political Web site called Politico
which both democrats and republicans will read. And so we got to see quite a lot.
There's another cluster of green which are kind of independents. They're floating around over
here, and some other smaller clusters around there, but they didn't quite show up there. But I
think you get the idea. This is a traditional example of conflict in social networks, where there's
two tightly woven groups that are quite independent and the bridge between them is relatively
low.
Okay. Any questions about this? You got the idea? All right. So this was actually for
Microsoft TechFest on the campus here in '11. We were beginning to develop these techniques.
This is still Fruchterman-Reingold. But with clustering you can see the colored clusters, not very
effective, and this is what motivated us to try to do better.
On the bottom the singletons who are not connected -- I should say the way the network is
formed is Twitter lets you download not just, you know, if you search on these terms you'll get
all the tweets, but then you'll get the person who did the tweet and you get their follower
network. So you build a network out of the followers. Okay? So very powerful.
These people were not connected. About 20 percent of them were just independent. And then
you had one main large cluster here, but not well differentiated. And, as you can guess, the
clusters are not strongly identified, but we developed the technique called group in a box by
which we put the clusters in, believe it or not, a treemap.
So here are the singletons. That's the biggest group, actually. And then this main cluster -- this
is Microsoft and its publicity mechanisms, and so all the people around here. These were
Microsoft Research employees. And then we had a certain group -- I forgot which group this
was, but we had a Brazilian group in here, which surprised us. And then the smaller groups get
laid out over here.
So I would advocate rather than trying to draw one graph that you draw these multiple boxes. In
these cases you can say the clustering is not so solid because there are lots of links between some
of these clusters. So they're not type or well form clusters. In NodeXL we let you delete the
edges or bundle them if you wish -- I'll show you that soon -- between the cluster so as to clarify
what's going on in the cluster.
My playful motivation for this when I gave the first talk about this was I went out and bought
some grapes and I sort of asked my audience, and I can ask you, how many grapes are in this
picture? Any guesses?
>>: [inaudible]
>> Ben Shneiderman: A hundred? 200? 300? Well, that's pretty good. There are 149 grapes
here. But it's hard to tell, hard to count the grapes.
How many clusters are there? It's hard to see. But if I tear them apart and then lay them out on
the table, as I did, you can count them. And there are -- you can count -- pretty close, there's still
a couple of obscured ones, occluded ones, but pretty close to counting the 149, and you can see
the nine clusters that were there in the grapes.
So it's sort of motivating the idea. And this was a conference I attended in April at MIT called
Collective Intelligence, and this one turned out to be very nice and a good demonstration of our
techniques.
And so this main cluster was a group of academics that includes me over here. I've become quite
active in Twitter. Actually, how many people have Facebook accounts? About a third. How
many people have Twitter accounts? Whoa, only about six or eight. That's pretty interesting and
pretty typical. Computer science people are not quite into the social network things as much.
And I have to tell you, when I speak to business or sociology or other student groups, it's 95
percent. And this conference does not even -- does it have a hashtag? I searched on GD 2012. I
found three tweets including one that I had of announcing the conference. There were two
others. But there's just not a tweet stream coming from this audience, which just reminds you
about the sort different kinds of people in communities there are.
But conferences like the Webshop we ran generated thousands of tweets in a 24-hour period. So
people are quite active, and understanding those patterns is of course an important social
question, business question, but also important national security and health and other
applications, which is why there's such a strong interest in studying these Twitter patterns.
In any case, the largest group was over here, and there were other people you may recognize. I
guess Elizabeth Churchill from Yahoo! I can't even see it on this screen from here. Let's see.
Sean Munson, now University of Washington, Mike Bernstein from MIT. There were a whole
bunch of those academics.
And then we had another group of -- this was I guess the French group, this Brazilian group was
here, this woman was in the room and was very excited to see that she was quite central to the
discussion there. And this German fellow over here turned out to be an important component
and he had his own little community.
So being able to see these communities. And we -- in this case I used the technique called
combined edges so that the edges across the clusters were combined into the single light gray
edge so you could get an idea of the relative connectedness among the communities.
We also do edge bundling and curly edges and we can have tight or lose edge bundling. And so
here was another conference that was run at Maryland called Theorizing the Web, and the main
cluster had links to quite a few of the other clusters.
But I'll show you others where you see distinct differences among the clusters. For example, you
see no links between these clusters, or very few between these other clusters, but this main
cluster had many links to the others.
This gives you an example of the power of this. We were approached, Cody and I, and
especially Cody worked with Scott Dempwolf, an analyst who was working on innovation
patterns in the state of Pennsylvania. This is 11,000 nodes and 26,000 edges. It looked quite
beautiful. Done in NodeXL. But it doesn't really tell you too much about what's going on there.
And so we need to -- hope you can see it.
But there we broke out the clusters. The main cluster turned out to be two key individuals who
each had about a hundred patents, and they were the main drivers of innovation and economic
development in the state of Pennsylvania.
Secondly, Westinghouse Electric in Pittsburgh was a great source of patents and other innovative
work, and then we had the unfortunate problem that the two suburbs of Philadelphia were
diagonally across, and there's some wispy lines going across them there, and so you get quite a
lot. So that was clustering. But if I apply filtering, I can make a simpler story and get rid of the
edges, and now I can much more clearly see who are the key groups, who are the key influencers
in each group, and if I want to get out there and try to promote innovation, that might be a good
way, so I filter down even more to just having about a dozen groups, I now can focus my
attention and know what's going on in this dataset.
I asked Scott for an analysis of Maryland innovation, and so he gave -- he favored me by doing
that one, and this is our lab, Human Computer Interaction Lab. These are NSF grants and
copartnerships, so there's quite a few NSF grants in our group. Catherine Plaisant's there, I'm
over here, Jen Golbeck, the current director, Ben Bederson, Allison Druin, I hope you know
some of these names, and the partnerships we have with other groups around the state of
Maryland and their own clusters of Johns Hopkins or in Baltimore or other places.
So it was a kind of confirming sense that these analysis tools were giving us insight to what's
going on. This is also kind of pretty, but it is closer to NetViz Nirvana because you can read
almost all the names here. This was done automatically. And sometimes we clean these up by
hand to make it a little better and clean up some of the occlusions. But I think you get the basic
idea of how to do this.
And this group in a box strategy I would like to recommend to you. I hope some of you will try
it. And Cody, for his dissertation, will continue to work on what we call meta layouts, other
layouts which have other properties that are effective where the clusters sit inside one region and
then you can see the connectedness to other regions.
A question? Thank you. Roberto.
>>: I was wondering if clusters are placed inside the [inaudible] of the treemap and how do you
decide what is the classification, what is the underlying tree, so how do you decide ->> Ben Shneiderman: It's not a tree. I mean, it's a clustering. We use the Newman-Girvan -actually Michelle Girvan is a member of our faculty in physics, so that's the one I think was used
in this case. Do you recall? But we've got [inaudible] and we've got three different clustering
algorithms in there.
So you cluster. It creates a clusters. And it's not a tree structure, but they're partitioned. And
then they're laid out just by the size of the cluster here. The biggest cluster goes here and the
smallest one goes over here. Okay. So very straightforward. It's not optimal.
And do I give away the doughnut? Cody, don't do this. It's Cody. But the idea is put the big
cluster in the middle and paint the other ones around it, and then you'll be able to see the
connections more easily. And Cody's got three other ideas that will be I think nice
improvements in ways of dealing with clusters and relationships among clusters.
But I think also managing the edges, either deleting them, combining them, okay, or showing
them our powerful ways. Because you want to control. You're essentially filtering those edges.
You want to control the visibility so as to achieve NetViz Nirvana to be able to read what
remains and to understand what's going on.
And selectively, as a sequence, not just one picture that solves a problem, but a process by which
you interact and you successively explore hypotheses or seek out attributes that you believe are
important. You go in -- if people come to me with a network and they say show me what's there,
I say you're not ready. And I say what's your question? If you don't have a question, you're not
ready to work. Okay. You have to have a question. You have to have at least one. I mean, we
train our potential users to have questions, and that's where it starts.
Now, you know, we tried to have a systematic-yet-flexible, SYF, systematic-yet-flexible, process
to explore. So we try to go in order so we would accomplish the systematic approach. But when
something interesting pops up, you want to be flexible to go exploring. So it's not an automated
process. And domain experts have a huge amount of knowledge by which they will spot things
that we can't spot. Okay. Anything else? Thank you, Roberto. Yes, Christian.
>>: [inaudible] many of the networks that you have shown so far are actually two-mode
networks, like this one as well where the edges are defined by some other type of entity and the
code adjacency towards these entities. Have you built any repetitions with these [inaudible]?
>> Ben Shneiderman: Let's see. If I remember, this is not a bipartite. The nodes are all
principal investigators on NSF grants. And so it's ->>: NSF grants would be the other node?
>> Ben Shneiderman: The grants are not shown here. But, yes, we could.
So, yeah, bipartite graphs is a very tempting thing. We've done a few things, nothing brilliant.
And I think that's another good topic that we'd love to work on. Bipartite especially is tempting.
But I haven't found a brilliant solution for that except just lining them up and showing the
connections there, if you have a good idea.
Spotfire does include networks and it just has two regions and it will randomly jitter them around
in two regions and then show the connections that way. Not a much better solution, but it allows
for more than just a one-linear flow.
So I think bipartite, tripartite are other good problems to work on. They are many, many. We
have about 150 items on our NodeXL to-do list of things that we want to do. And then of course
conversion from a bipartite to a single mode graph would be another natural thing we've talked
about.
This was just pretty. This was analysis of community discussion groups. They were all very
independent. They were just simply a posting and then discussions that followed the posting.
But it just shows you the other way. You can go, and it's kind of pretty. I thought we'd make a
T-shirt out of this one or something.
Okay. So let me go on to grouping. And grouping is a very simple idea. Instead of clustering
where you cluster by edge, grouping you group by node attributes. For me the idea of a network
being points that are just like physics points in space are not very interesting. I really look for
data where the nodes have many attributes so you can do something interesting about them.
So the classic one was -- here was another version of the senate voting patterns. And if you
break them out by the attribute of which region of the country they come from, then you use
group in a box, you get this very nice structure, which shows the Southern senators have the high
degree, the republicans are very tightly woven together, the democrats less so, and then we see
other regions we have.
You can see immediately the relative number of democrats and republicans, and you can see that
like in the Pacific region that kind of -- that separation does not occur nearly as strongly. And
sometimes there are overlaps as well.
So you get a lot more by tearing apart the graph and showing parts of it at a time, and that is a
big win if you have categorical attributes for the nodes.
We can also -- I mean, in NodeXL, you can replace multiple nodes that are the same attributes
with a single group node, a node that has a big plus sign in the middle of it, so you can simplify
the graphs by grouping. Which takes me immediately to Cody and the idea of let's find another
way to simplify the graph like common motifs.
>> Cody Dunne: So we can take these nodes and combine them into a meta node, but when you
do that you don't know really what -- sorry -- where it came from, you don't know anything about
the underlying topology, you don't know anything about the attributes.
But my idea with motif simplification is to take specific repeating patterns that take up a whole
lot of the screen space and replace it with representative glyphs that tell you what's inside them.
So, for example, we have a fan motif. It's all these singly connected nodes that are connected to
only one head node and then to the rest of the network.
And the idea behind the glyphs is to replace these fan nodes with a fan-shaped glyph, you know,
the arc is sized according to how many nodes it's replacing, so a large glyph will replace a lot of
nodes, a small glyph will replace a small amount of nodes, and then if you have a color scale on
it, in this case going from orange to purple, let's take all those attributes, so let's apply a function
to it, like the mean, and let's put that on the same color scale and color the glyph using that color.
That way we can show some information, anyway, about the attributes, we can show how many
nodes were inside it, and then the topology that it's replacing.
Similarly, we can look at a connector motif. This is ideal, you have this functional equivalent
span nodes in the middle that are doing nothing except connecting two or more other nodes
together. And we can replace these with the exact same visual representation as you would get if
you drew it nicely in a graph, this tapered diamond shape.
Again, we could do some sizing based on how many nodes we're replacing, we can have meta
edges on each side that are sized or colored depending on the edges that they replace. And,
again, we have some nice coloring for those things.
Let's look at an example here. Here we have a bipartite network. It's wiki edit, so there's four
wiki pages here from the Lostpedia wiki. So they have their main discussions and they have
their theory of the lost universe discussions and they're kind of separate. So we have those four
wiki pages, and then we have all the little circular editors editing those pages. You can see
there's a lot of editors in those big fans that had only one page, and there's a fair number in the
middle that edit two pages. And then in the very center of the drawing there's some that edit two
or three or four pages and so on.
If we take these motifs and replace them with glyphs, we get that drawing there on the right. So
that really big fan down in the bottom is replaced with that large arc fan. You can see that it has
a very purple attribute value, whatever purple means in this network, and then we can see the
cycles of the fans throughout it.
And then we can also do the pairwise connections and see that main discussion and main have a
whole lot more editors editing both of those than main and theory on the far right. We went from
about 512 nodes down to 25. And so now it's easier on you, it's easier on seeing labels at a
distance, and it's easier on your layout algorithms, although this is going to be a denser network
after you get rid of all that peripheral stuff.
So we can also look at cliques. Cliques are very interesting parts of a network. And finding a
maximal clique is a hard task for a user to do just by looking at it. We can take these cliques, so
like a four clique, five clique, six clique, replace them with glyphs, again size depending how
many member nodes there are in the clique. And when we look at this senate example that we
just saw, 65 percent agreement, this entire network gets simplified down into three cliques and
one little individual node there.
Okay. We have 51 nodes in that top right democrat clique that's all the democrats, it's two
independents, and it's Olympia Snowe who's actually in that. So it's a little bit off-blue instead of
just being pure blue. You can't really see that, though.
We have on the bottom left 38 republicans. And then in the right we have four moderate
republicans. We have McCain -- let's see who else is in there -- that's Collins and Smith and
Specter as well. These are the moderate bridge builders. And we can see based on that meta
edge that they're tightly connected with the rest of the republican party, but there are also a fair
number of connections to the democrats.
And then down there in the bottom we have Coburn. He's a very staunch republican when it
comes to very contentious issues, but he votes with his heart, so he's a little bit of a wildcard and
he just kind of pops out right there.
So let's see what happens when we go to a higher threshold of agreement. We were at 65
percent. Let's move on to 70 percent. So in the node link visualization, when we laid it out
again, I think it was with Harold Corn [phonetic] when I did that, it spreads out a little bit more.
You see a few more of the edges disappear, and you see a little bit more information happening.
You see Olympia Snowe come out of the democrat clique. She's got that really tight connection
there still. We still see Specter and Collins in the middle, and then over on the left side we see a
bunch of wildcard republicans come out.
And I was talking to my brother, he's in political science, he says yes, these are the people who
you just don't know what's going to happen with them. We've gone Voinovich and Vitter and
Hagel and that Coburn that we saw before. And that small clique, it's still got McCain in it, but
the members have actually changed as the edges have been deleted. Some of them came out of
the main clique into his little moderate clique, and others went out into the rest of the network.
Moving on, 80 percent agreement, we see the network it gets bisected, we saw this really tight,
dense connection on the democrat side, but the republican party cohesion is starting to break
down in 2007 anyway.
Up in the very top right we have the extreme liberal democrats. These are the East Coasters.
Let's see. Who's that? That's Lieberman, Feingold, Kennedy, and Biden. Feingold, anyway,
used to be an East Coaster. And then the small clique just below that are the really moderate
republicans. That's actually called the Blue Dog Coalition. And right there down there in the
bottom, Nelson sticking out, he is the epitome of Blue Dog democrats. And for of those you
who don't know, the Blue Dogs are the extreme moderates of the democrat party.
So, as we continue on, 85 percent, we see a little bit more breaking down, 90 percent, we start
seeing a whole lot of segmentation here. 95 percent, we still have a democrat clique, but then we
have Isaacson and Chambliss. Those are the only republicans left. And those are from the same
state, Georgia.
So we can see some interesting patterns here just by looking at the maximal clique. And in this
case I was just using a greedy algorithm, taking all of the cliques in the network, using the that
Tomita algorithm, I believe, and then just picking the largest one as we go.
There is a more expensive approach you could use to find which ones would be most effective to
show, but this case it actually worked rather well.
Now, one thing you might want to think about when you're figuring out what motifs you want to
show and how to combine them is how you can overlap these glyphs together. Right? How can
you design your motifs like the fans so that they can hang off the side of a clique or your parallel
motifs, your connector motifs so they can hang off the side of a clique as well. There's some
design issues going into how do you show these things in a small amount of space and in
interesting combinations.
And of course it's not very useful unless you have interactions so you can see what's actually
inside them. You have tool tips that can show exactly how many nodes are inside it if you can't
understand the scale, information about where these -- all these leaf nodes are anchored, and then
the context menu that lets you move back to the original visualization that's -- well, it's losses.
You start with a big simplified view, you get an overview, but you then can get to the details on
demand.
Let's look at another network here. Here we have a big Web crawl, something like 4,000 nodes.
Anytime you do these egocentrically collected datasets for Web crawls, or what we'd usually do
to get social science datasets, you have those big fans along the periphery.
Here we have 800 nodes in any one of these fans. And you might say that this is a reasonable
drawing. I can see most of what's going on here. But there's actually some hidden features
based on the land that's used.
If we color by the fans in the network, we can see a lot of the ones we saw before. But then
down there in the bottom right we see a lot of overlap between the two fans. And there's actually
a bunch of black nodes in the middle that aren't part of either.
You know, this is a layout heuristic at work. It isn't showing us every individual feature. But if
we simplify these things away and including all those little fans there in the middle that really
shouldn't be there, they're not really central core parts of the network, if we simplify those things
away into the fan glyphs, we're using much smaller ones in this case, we now have dropped the
amount of screen space required by two-thirds.
We still have the glyph so we can look at them. If we look at them closely, we can see the arcs
and see exactly how much leaf nodes there are in these individual places. But we don't really
care about those leaf nodes. We just want to know that they're there and how many of them
there are. They're not the core interesting things to us. The rest of this stuff is, including that
giant connector motif down there in the bottom right that was completely obscured by those fans.
So if we do the connector motifs, we color them, we see all their original locations in the
network, we simplify them away, and now we have a drawing that has a fraction of the original
number of nodes. It's a lot more dense, but you can lay it out again in a much larger screen
space. And if we have a color scale, in this case it's Eigenvector centrality, that color gets
mapped onto the nodes -- sorry, the glyphs there.
Just some information. Those two networks that I was showing, the Lostpedia wiki and the
Voson network, we drop the number of nodes by an order of magnitude pretty similar with the
number of edges. And when you start looking at metrics for graph drawing aesthetics or
readability metrics, you'll actually see that because we're getting rid of a lot of these edge
crossings, we're getting rid of a lot of this node overlap caused by our layout heuristics. In a very
limited amount of screen space, we're going to be ending up with a much more readable drawing.
So we showed this to some users. They said I'm overwhelmed. It's like one of those vision tests
at the eye doctor when they're looking at the original network. But then when we put the
simplified version, they said, okay, now I can see the central pages, there's few enough nodes in
the network that I can do pairwise comparisons to look at the things. When there's 4,000 things I
can't do that, when there's 500 it starts to become feasible.
And I just finished running statistics this morning on 38 users using this study, and it turns out
that for a lot of interesting tasks, like finding labels and finding maximal cliques and doing
things like that, this approach really works. For other stuff like tracking edges there's some work
to be done still. But there's some interesting results there.
So motif simplification, it's pretty good for producing complexity and understanding the large
relationships in your network. However, like I said, it might not be so great for the edges. The
frequent motifs you're interested in might not be included in the corpus of things I like to do.
So if you're a biologist, you're interested in feed-forward loops or something like that. What I
focused on is the really high payoff things that take up a whole lot of screen space but don't
really give you much in return.
And of course glyph design has tradeoffs. If you want really tiny glyphs that take up a very
small amount of the screen space, you can't show distributions in them. You can't show much of
the information about the underlying nodes. So how do you design that to match that tradeoff.
And there's all sorts of details and algorithms in our tech report if you're interested.
And with that I'd like to turn it back over to Ben to finish this off.
>> Ben Shneiderman: Maybe I should just pause a minute and maybe go back to originally my
-- I'll just review all that stuff. I want to go to my list. I should have had that -- oh, too far back.
Where's my list? There.
So this is the first time we publicly presented the motif simplification. I just want to pause a
little and ask for your comments. We know that it's a little extra cognitive complexity because
you have to learn what those motifs are, you have to train your eye and your mind to look for
them, but the dramatic simplification seems to be a winning strategy.
And, as Cody said, he's just finishing. He has two more hours before the deadline for the CHI
conference to finish and submit the paper about it, but the results were very promising. There
were 31 different tasks, and not every one of them was the motif simplification and benefit, but
for many it was and we're trying to understand better when it works and doesn't work. So any
comments or challenges? Yeah.
>>: So both of those motifs you listed are examples of the sorts of things we found by a
technique called modular decomposition of a graph where a module is a set of vertices that all
have the same connections to the rest of the graph? And so I wondered if you had thought about
using modular decomposition more generally as a way to visualization.
>> Cody Dunne: I hadn't talked -- thought about that in specific. I'd like to talk with you about
it after. I had used an approach called Graph Summarization by Saket Navlakha that was
designed mainly for biologic networks, finding functional equivalent things. But the problem
was, like I said, it's these heuristics that combine things without you really knowing what the
topology was.
>> Ben Shneiderman: So we chose only three to start with. Fans, connectors, and cliques.
Which are -- and I guess where's Natalie Reese [phonetic]? Natalie did good work about doing
cliques and near cliques. So that was another inspiration for us as well.
And there are lots of -- the algorithms exist to find all these things. We thought the important -the contribution here is to turn them into glyphs that made sense, and there was more than we
thought design effort about choosing the edges and the colors and those things.
I see two more hands. Ulrich, your chance.
>>: Two generalizations you might want to think about is, first of all, the degree 1 nodes, you
can extend those into complete trees and then draw them as [inaudible].
>> Ben Shneiderman: Degree 1 nodes -- say again?
>>: If you look at the one shell, so you eliminate all the degree 1 nodes, continue doing this until
you stop.
>> Ben Shneiderman: I see. I see.
>>: [inaudible]
>> Ben Shneiderman: Recur on the idea.
>>: [inaudible] that you represented as ->> Ben Shneiderman: So the fan of fans essentially. Okay.
>>: And the second thing is that the connectors are not only a special case of modules and
modular decomposition, but the first [inaudible] here are all the same as well. They're
[inaudible] coordinates [inaudible] same neighbors but are not connected to each other or are all
connected to each other.
>> Ben Shneiderman: Right. Well, we chose the very simple case of the two, three, and four
connectors. That's what we were -- that's what we implement. This is working. It's in NodeXL.
It's shippable. You can try it. Well, close to. Is Tony Capone [phonetic] here?
>>: Linear time.
>> Ben Shneiderman: Tony is our programmer.
>> Cody Dunne: What's that on linear time?
>>: You structure the equivalent classes and determine linear time, so it might be interesting
because it extends to higher degrees as well.
>> Ben Shneiderman: Well, we see many general ->> Cody Dunne: So for the connector motifs, I'm finding them in time proportional to the
number of edges times the average -- sorry, the number of nodes times the average node degree.
>> Ben Shneiderman: We think they can be made faster to identify, and then the control panel
about how you do the replace. And what you allow for color and shape and size, also the
cliques -- well, okay. There's still lots to be done in many variations.
Cody mentioned that specially rich motif work is done in biology, where they're looking in
particular motifs in the biological pathways.
And I saw one more hand. I give Peter a chance.
>>: [inaudible] I think the clique I think [inaudible] when we did that was that so the [inaudible]
we're looking at is [inaudible] but an ambiguity of it [inaudible] every graph that we looked at
had many different clique counts and so you need an authorization [inaudible] choose what is not
unique. Lots of different ways of [inaudible].
>> Cody Dunne: So I think the best approach would be to solve the set packing problem and
choose the set of cliques that combine the most nodes or satisfies whatever property you're trying
to satisfy.
The one I'm doing right now is just a greedy approach, examples anyway. Where there are a lot
of overlapping cliques seems to perform pretty well.
>> Ben Shneiderman: And let's take two or three more. I love this. Yes. That's what we came
for.
>>: I'm sorry, I'm having trouble telling the difference between a clique cliff and a connector
cliff. They look the same to me. Are they different?
>> Ben Shneiderman: Yes. Clique is a fully connected subgraph.
[multiple people speaking at once]
>> Ben Shneiderman: The cliff, oh, they're rotated. They're 90 degrees rotated. We had long
debates about this. I in my last e-mail to Cody said I thought that's a problem too. They're
rotated by 90 degrees. They're lovely ->>: Then we have edges coming in all different directions.
>> Ben Shneiderman: Correct. But that's another problem we were not able to solve in the cute
way. But when you have three and four connectors like you do over here, you can't make them
come in one corner and out the other. It doesn't work out.
We had other -- we had a bridge-shaped clique and a glyph. We tried many things. You may
find ways to improve this. We look for your suggestions. This is still a fresh idea that's just
getting tuned up. It will be part of Cody's dissertation, so don't steal it until you please help him.
But I think -- there's many things to be done to extend these ideas, and we think there are many
ways people go -- and, Roberto, you get the last word.
>>: I was just wondering why the [inaudible] have one vertical edge and then the clockwise
because ->> Ben Shneiderman: Yes. That was my obsession.
>>: [inaudible]
>> Ben Shneiderman: That was my obsession. I won't fight that battle. I wanted them always to
start straight up and then they would arc out over. And that way you could tell more easily the
difference by how much angle was attended and also visually if in a cluttered graph you would
be able to spot them easier, but you're looking essentially -- your preattempt of processing is
looking for that one vertical edge.
Notice also in the current design the length of these fans is the same. That was another feature to
keep this simple. We could have encoded some other variable, and yet we tried to make the
glyphs look different from nodes, so we didn't use -- we have many different glyphs that we use
circles, triangles and so on, but we tried to make these look different. So the difference should
be there. There always should be a pointy part and a curvy part.
>> Cody Dunne: There's another cool thing you can do with the fan orientation. So just like this
can overlap on each other and make a little pie chart and show us the proportions, but if you care
about directed edges, how many of those edges to those fans are going out or coming in or
bidirectional, then you can actually segment it in terms of the difference from the vertical. That
way you see exactly which ones are going which direction without having to draw any additional
little icons.
>> Ben Shneiderman: Right. So the further version for directed graphs splits this in three
sections and they're pointed straight up so you go left and right and then center.
So there's a bunch of design problems that remain here, and especially Cody showed you this
little thing, that they actually are combinable. And we think we got it right, but we have no proof
that these will combine in a way that does not cause conflict, and we also still have to prove the
theorem that says the order in which you create these glyphs will not affect the outcome.
But we think we can argue that case. Okay? I mean, there's some suggestion if you did one of
these first then the other wouldn't work out, but, no, we think we defined these motifs in a way
that they're independent of the sequence in which you apply them. Got it? Okay.
So we can talk more about it, but our time is running out, and now I'm going to have to go
forward.
I just have a couple of closing slides, so we'll flash through here as fast as this will let me go, and
you can review the whole presentation. And there we go.
So just to say more about NodeXL, we do have the NodeXL Graph Gallery. This is a public
open source place like many nets where you can upload your graph datasets and your
visualizations. There's thousands of them out there by at least hundreds of people, many
different ways. Some of them are good, not all of them are beautiful.
And we also -- it's in NodeXL which runs inside Excel. When you say export, it will export to
graph gallery and give you the option of exporting the dataset as well.
So this page, if you're looking for network datasets, there are thousands of network datasets out
there, and you can go and grab them if people have up loaded them.
And usually there are descriptions of them in great detail. Mark Smith, who's our strong
collaborator, has many politically oriented ones, and so you can take a look at whatever he's up
to today.
That's the book. The first three chapters describe network analysis and social media, then there's
four chapters that walk you through the use of the tool, and then we have eight application
chapters that shows you analyses for e-mail, threaded networks, Twitter, Facebook, World Wide
Web, Flickr, YouTube, and wiki network.
So it's sort a starting place. We wrote this as a textbook and to guide newcomers from many
different disciplines to be able to create their own networks.
NodeXL was supported on -- happy to say this on Microsoft territory -- by Microsoft external
research for more than four years, and then they said it's time for you to got off and find the rest
of your support. And so we are now owned by what we created, the social media research
foundation for free, open source data, open data, open tools, open scholarship.
And we struggle. So if you can find someone to help sponsor us and support it, we'd appreciate
that. But the social media research foundation is the home base for NodeXL. We see it as like
the R statistics package. We'd like to keep it going as a community-based open source and free
tool for people to use.
So if you can help us out, please do. Come visit SMR Foundation or the NodeXL site itself.
And I'd just close by thanking you from HCIL and our 25, 30 now years of happy use, of happy
community, and the NodeXL Web site is down here, nodexl.codeplex.com.
And I thank you for giving me the opportunity and look forward to discussions. Thank you very
much.
[applause]
>> Ben Shneiderman: Thank you. Okay.
Download