NodeXL_tutorial_draft - ccb

advertisement
Network Analysis with NodeXL: Learning by Doing
Derek Hansen, Ben Shneiderman, Marc Smith
Draft Version (2/25/09)
Please do not distribute
Draft (2/25/2009)
: Network Analysis with NodeXL
- 1 -
Network Analysis with NodeXL: Learning by Doing
1) Basic: Getting started with NodeXL
Get started by opening NodeXL at the Basic Layer which shows the usual Excel menu bar is across the
top, a blank workbook on the left, and a graph pane on the right (Fig 1). The idea of NodeXL is that
users fill in the columns in the Edges worksheet with an edge list consisting of vertex pairs that are
related to each other.
Figure 1
Data entry: One way to begin using NodeXL is to type in your own edge list. For example, you might
type the name of an individual who invited another person to a party in Vertex 1 and the name of the
person who was invited in Vertex 2 (Fig 2).
Figure 2
Draft (2/25/2009)
: Network Analysis with NodeXL
- 2 -
Reading the workbook to display the graph: Then when you click on the Read Workbook button
directly above the graph pane, a network showing the party invitations is displayed (Fig 3). Note that
our example assumes directed relationships, that is, Ann invites Bob, but Bob has not invited Ann to a
party. The arrows point from the inviters (people in Vertex 1) to the invitees (people in Vertex 2).
Figure 3
Highlighting an edge: Clicking on one of the workbook rows will highlight an edge and the two
vertices in the graph. For example, clicking on row 5 highlights the edge connecting Ann to Carol (see
Fig 4). You can even click on multiple rows an all related edges and vertices will be highlighted.
Figure 4
Draft (2/25/2009)
: Network Analysis with NodeXL
- 3 -
Importing an edge list: Another way to begin using NodeXL is to use the Import command to load an
edge list supplied by someone else. The Import Command is found on the NodeXL Ribbon (see Fig 5)
along with other NodeXL specific commands. Someone may provide you with an edge list in the form
of a Pajek file (another social network analysis program) or in a standard Excel workbook.
Alternatively, you can cut and paste from another Excel spreadsheet to fill in the edge list.
Figure 5
Additional import options (e.g., importing an email or Twitter network) are also available (see Fig 5).
Hiding and Resizing the Graph Pane: As you work with the data you may want to temporarily hide
the graph pane. You can do this by clicking on the Show/Hide Graph Pane button in the NodeXL
ribbon. You can resize the pane by moving the cursor to the left side of the pane until you see the ↔
symbol and then dragging it to the desired size. It is also possible to undock the pane or move it above
the worksheet data by clicking on the title that reads “Document Actions” and dragging it around.
2) Layout: Arranging Vertices in the Graph Pane
Manual Layout: Let’s return to our example of the 8 party invitations. You may have had trouble
making sense of Fig. 3 because the vertices are in disarray (they were placed on the graph according to
the default layout type Fruchterman-Reingold). You can click and drag the vertices one at a time to
create arrangements that emphasize structures or create a more orderly display (Fig 6). You can even
select multiple vertices by drawing a box around them or clicking on additional vertices while holding
down the Control key. If multiple vertices are selected they will all move together when dragged.
Figure 6
Draft (2/25/2009)
: Network Analysis with NodeXL
- 4 -
Automatic Layout: NodeXL also offers several automatic Layout Types that can be selected from the
control in the graph pane (which has the label Document Actions), such as the Circle Layout (Fig 7).
Figure 7
Experimenting with different Layout Type (e.g. spiral, grid, Sugiyama) can reveal useful patterns,
relationships, or unusual features.
Undirected Graph Type: In NodeXL, the default Graph Type is Directed, which means there is a
relationship between Vertex 1 and Vertex 2 that may not exist between Vertex 2 and Vertex 1. In the
example with party invitations, the meaning is that Ann has invited Bob, but Bob has not invited Ann
to a party. This is shown in the graph pane as an arrow from Ann to Bob. NodeXL also allows users
to specify the Graph Type as Undirected, which means that relationships are equal between people, for
example the relationship might be that they have spoken to each other. You can select the Graph Type
on the left-hand side of the NodeXL ribbon (see Fig 5). Then after clicking on Read Workbook, the
new network display is shown using links between vertices without arrows (see graph pane in Fig 8).
Figure 8
Draft (2/25/2009)
: Network Analysis with NodeXL
- 5 -
Updating the Graph Pane: Any time you change the underlying data or features that affect layout
(e.g., directed versus undirected), you must click on the Read Workbook button to update the graph. If
you just want to change the layout you can select a new layout type and click on Lay Out Again to
reduce processing time.
Zooming and Scale: To get a closer look at a subsection of a graph you can use the Zoom slider (or a
mouse scrollbar after first clicking somewhere in the graph pane). Once you are zoomed in you can
pan across the graph by holding down the Spacebar, clicking the mouse button, and dragging the curser
in the direction you want to pan. You can also use the Scale slider to change the size of the canvas,
which makes the vertices and edges get larger or smaller.
3) Visual Design: Making network displays meaningful
Drawing a meaningful graph can reveal patterns, relationships, and interesting features that may be
hard to spot in a tabular edge list. NodeXL is designed to enable you a rich variety of possible
drawings for a graph.
Node Colors: You may want to change the colors of nodes. For example, in the friendship graph, you
might want to color nodes that represent men with blue and the women with pink. Look at the tabs on
the lower left and click on the Vertices tab, which will bring up the list of 8 vertices (also called nodes)
in our party invitation data set. You can select colors for the 3 women and 5 men, then click on Read
Workbook to redisplay the Graph Pane (Fig 9).
Figure 9
Draft (2/25/2009)
: Network Analysis with NodeXL
- 6 -
Fixing Node Placement: If you want to the placement of the nodes to remain fixed in their position,
find the “Locked?” column on the Vertices tab and choose “Yes (1)” (or just “1”) for each of the
vertices. This prevents the vertices from moving when you click on Read Workbook again. You can
also use the two columns labeled X and Y to fine-tune vertex placement so they are exactly aligned
(see Fig. 11). For example, it may be helpful to have the same Y-value (which controls vertical
placement) for all of the top level nodes in Fig. 9. After changing these values and locking them you
will need to click on Read Workbook again to update the graph.
Subgraph Images: Another NodeXL feature is the drawing of Subgraphs for each node (second
column in Fig 9). This is accomplished by going to the Analysis section on the NodeXL ribbon (Fig.
5) and clicking on the Create Subgraph Images option. The dialog box (Fig 10) lets you control the
properties of the drawing and where to put the results. Note that this feature can be computationally
intensive for large data sets.
Figure 10
Draft (2/25/2009)
: Network Analysis with NodeXL
- 7 -
Adding Descriptive Data: If you have additional information about the people in the data set, you can
add your own columns of data by typing (or pasting it in). You might want to record the age of each
person, so scroll the Vertices worksheet to the right until you see the column header “Add your own
Columns Here”. Place the cursor on this header to get further instructions. If you select the next free
column, you can type an attribute name (e.g., Age) and then enter values for each person. In Fig. 11 we
have added two new columns, one for Age and one for number of Prior Parties the individual has
attended since the beginning of the year.
Figure 11
Changing Vertex Size (and other properties): Another visual property that can be used to encode attribute
values is node size, which is controlled by the Radius column in the Vertices worksheet. Put your curser over
the Radius column header to show the type of data that must be entered – in this case numbers 1-10 (Fig. 12).
Use this same approach to see what type of data to enter into any of the different fields such as Shape, Color,
and Opacity.
Figure 12
Draft (2/25/2009)
: Network Analysis with NodeXL
- 8 -
There are three ways to enter numbers into the Radius column (or other visual attributes such as Opacity or
Color): (1) You can manually type them in, (2) you can enter a formula that calculates a number for the radius
based on some other data (e.g., the Prior Party field you entered earlier), or (3) you can use the AutoFill
feature to let NodeXL fill in the column based on some other data (e.g., the Prior Party column). Figure 13
shows the result of using the NodeXL AutoFill feature to automatically fill in the Radius numbers based on the
Prior Party data we entered earlier.
Figure 13
AutoFilling Columns: To recreate Figure 13, first click on the AutoFill Columns button in the NodeXL ribbon. The
resulting Dialog box (Fig. 14) offers a set of drop-down boxes to allow you to select data you have entered in as
additional fields. Click on the downward pointing arrow next to Vertex Radius to see all of the data columns
you have entered in and choose Prior Parties (instead of Age). Notice that you can do the same for many other
visual attributes of the Vertices as well as the Edges. Those associated with vertices populate columns in the
Vertices worksheet, while those associated with edges populate columns on the Edges worksheet. Note that
the column data won’t show up until after you click on Read Workbook.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 9 -
Figure 124
Each attribute has an associated Options page that allows you to fine-tune some of the attributes. In our
example, we want to assure that the Nodes are large enough to view well, so we can click on the “…” button in
the Options column for the Vertex Radius row (Fig. 15). Here we can change the smallest number in the
column from the default value of 1.0 to 2.0 so the nodes will all be sufficiently large to see well.
Draft (2/25/2009)
Figure 15
: Network Analysis with NodeXL
- 10 -
Changing General Graph Appearance: Another way of setting visual features is to go to the Graph Pane and
click on the Options button (or right click in the Graph Pane and select Options) to bring up the Options Dialog
Box (Fig 16). It offers controls for setting visual features for Vertices, Selected Vertices, Edges, Selected edges,
Fonts, Margins, etc. Note that some of these (e.g., Radius, Opacity, Color) will be superseded by numbers in
the corresponding columns on the Vertices or Edges worksheets if they are populated.
Figure 136
Draft (2/25/2009)
: Network Analysis with NodeXL
- 11 -
4) Labeling: adding text labels to nodes and links
Adding Labels: You can invoke the AutoFill Columns feature again to fill the Primary Labels column with the
names from the Vertex column. Then when you Read the Workbook, the nodes become filled with the labels
(Fig 17). The color coding remains but the size coding is no longer possible. In this case the Pink color made
the text too light to read easily, so the color was Pink was changed to Deep Pink.
Figure 17
Using Secondary Labels: It is possible to use labels and other characteristics such as Radius using the Secondary
Label column. To re-create Figure 17 you can use the AutoFill feature to fill the Secondary Label with the
Vertex column (and remove the Primary Label). This will result in the radius and label showing up at the same
time. To make the graph and labels more readable you can set the maximum radius to be 6.0 instead of the
default 10.0 (see Fig. 15). You may also want to make the Edges semi-transparent so labels that overlap with
them will be more readable. Do this by filling the Opacity column on the Edges worksheet with 3.0s.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 12 -
Figure 18
Adding Tooltips: You can also add data that only shows up when you mouse over a vertex. This is called a
Tooltip. In Figure 18 we have AutoFilled the Tooltip column with the Age column and when you mouse over
Helen you will see her age (22 in this case).
Draft (2/25/2009)
: Network Analysis with NodeXL
- 13 -
5) Graph Metrics: Calculating and visualizing metrics
When analyzing networks, network analysts often want to identify important nodes, locate subgroups, or get a
sense of how interconnected a network is compared to other networks. While visualization itself can help do
this, it is often helpful to use graph metrics that provide quantitative measures that characterize various
aspects of a graph. NodeXL can calculate several graph metrics for you. Once calculated, you can use the graph
metrics to change the visual display of your network graphs in powerful ways.
Selecting Graph Metrics: You must first choose what graph metrics you want to calculate. To do this, click on
the downward pointing arrow next to the Calculate Graph Metrics button on the NodeXL ribbon and then
choose Select Graph Metrics. This will open up the dialogue box in Figure 19 that shows you the available
graph metrics. Select the ones you want to calculate by checking in the boxes next to them. Clicking on the
Details link next to them will explain what the metrics mean. For now, click on the Select All button and choose
OK.
Figure 19
Calculating Graph Metrics: To calculate the graph metrics you will need to then click on the Calculate Graph
Metrics button in the NodeXL ribbon. Some of the graph metrics can take a while to calculate when working
with large networks. When all of the graph metrics are done calculating, NodeXL adds new columns will be
added to the Edges and Vertices tabs. NodeXL also adds content to the Overall Metrics tab.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 14 -
Kite Network Example
To better understand the meaning of the various graph metrics, we will use a network
called the Kite Network, created by David Krackhardt (see
http://www.orgnet.com/sna.html). If you want to save the party file that you have
created, do so now by choosing Save As… and then selecting Excel Workbook followed
by the name of the file (e.g., Party_NodeXL.xlsx) and then closing out of it. If you have
downloaded the Kite_Network.xlsx file from (website) you can then Open the file. If you
have not, you can enter the names found in the edge list shown in Fig 20. Make sure you
choose “Undirected” in the Graph Type as shown in Fig 20. Once you have opened the
file, calculate all of the graph metrics as described above in this section.
Overall Metrics: Go to the Overall Metrics tab (Fig 21). This tab summarizes some of the
key properties of the entire network including the graph type (Undirected vs Directed),
the number of unique edges, edges with duplicates, and total edges. For some networks
the Edges with Duplicates is an important measure since two vertices may be joined by
more than one edge. This is not the case for the current network. The Self-Loops row
shows the number of vertices that link to themselves. A self-loop occurs when the edge
list includes the same exact name in the Vertex 1 and Vertex 2 columns on the Edges tab
(i.e., a person is connected to themselves). The tab also shows the total number of
Vertices, the NodeXL version that it was created using, and the Graph Density. The
Graph Density is a number between 0 and 1 that is an overall measure of the
Figure 20
inter-connectedness of the vertices. For a undirected graph where all vertices are
connected to all others through at least one edge, the Graph Density is calculated by dividing the number of
Total Edges (18 in the example) by the maximum number of possible edges (45 in the example), which equals
0.4 for the example. A more dense graph (e.g., 0.6) would include more Total Edges, while a more sparse
graph (e.g., 0.2) would include fewer.
Figure 21
Draft (2/25/2009)
: Network Analysis with NodeXL
- 15 -
Degree, Centrality, and Clustering Coefficients: Next, go to the Vertices tab. You will notice there are several
new columns that were inserted next to the Vertex column after you Calculated Graph Metrics (see Fig 21).
Each of these metrics relate directly to one of the vertices. For example, row 2 shows the various graph metrics
that are specific to Andre (Fig 22). Below is a brief description of some of these metrics.
Figure 22
Degree (Column F): The Degree of a vertex (sometimes called Degree Centrality) is a count of the number of
edges that are connected to it. Notice that Diane has a Degree of 6 because she is directly connected to 6
other individuals. In comparison, Jane has the lowest Degree with only one connection to another person. If
the edges represented strong friendship ties of individuals in a class, we might say that Diane is the most
popular person in the class and Jane is the least popular. If we were using an undirected graph (such as the
Party Network), the single Degree metric would be split into two metrics: (1) In-Degree, which measures the
number of edges that point toward the node of interest, and (2) Out-Degree, which measures the number of
edges that the node of interest points toward. Note that in Fig 22 we have used the AutoFill Columns feature
to set the Radius of each vertex to a number between 2 and 7.
Betweenness Centrality (Column B): While being popular may be important, it is not everything. Let
us consider Heather for a moment. Notice that she is only directly related to 3 other people (i.e., she
has a degree of 3). Despite her relatively low Degree, her position as a “bridge” between Ike (and
indirectly Jane) to the rest of the group may be of utmost importance. If, for example, information
were passed from one person to another, Heather would be vital for assuring that Ike and Jane could
communicate with the rest of the group. In fact, if she was removed from the network, Ike and Jane
would be disconnected from the other class members. Thus, Heather has high Betweenness
Centrality. In contrast, Ed has a Betweenness Centrality of 0. Notice that if he were removed from the
Draft (2/25/2009)
: Network Analysis with NodeXL
- 16 -
graph everyone would still be connected to everyone else and their shortest communication paths
would not even be altered. More generally, vertices that are included in many of the shortest paths
between other vertices have a higher Betweenness Centrality than those that are not included. In Fig
22 you can see the AutoFill Columns feature has set the Opacity of each vertex to a number between 4 and
10, which is why Heather shows up the darkest and Ed and Carol show up the lightest.
Closeness Centrality (Column C): Another characteristic you may care about is how close each person
is to the other people in the network. If information flowed through edges in the network, some
people would be able to contact all the other people in only a few steps, while others may require
many steps. Closeness Centrality is a measure of the average shortest distance from each vertex to
each other vertex. In the Kite Network, Fernando and Garth have the lowest Closeness Centrality
measure, suggesting that they may be in a good position to spread information through the network
efficiently. Fig 22 shows the AutoFill Columns feature has set the Tooltip to the Closeness Centrality
metric (notice the number that shows up when mousing over Fernando).
Eigenvector Centrality (Column D): In many cases, a connection to a popular individual is more
important than a connection to a loner. The Eigenvector Centrality metric attempts to take into
consideration not only how many connections a node has (i.e., its Degree), but also the Degree of the
nodes that it is connecting to. Both Heather and Ed have a Degree of 3. However, Ed is directly
connected to Diane, the most popular person in the class, whereas Heather is connected to Ike who is
among the least popular. This helps explain why the Eigenvector Centrality metric for Heather is
lower than it is for Ed.
Clustering Coefficient (Column E): In some cases, a person’s friends may be friends with each other,
creating a clique. For example, Ed’s three friends Beverly, Diane, and Garth are all directly connected
to one another (i.e., they create a complete graph). In other cases, a person’s friends may not be
friends with one another. For example, Ike’s two friends Heather and Jane are not friends with each
other. The Clustering Coefficient measures how connected a node’s neighbors are to one another.
More specifically, it is the number of edges connecting a node’s neighbors divided by the total
number of possible edges between the node’s neighbors. For example, Heather’s three neighbors are
Fernando, Garth, and Ike. Only one connection exists between any of them (the connection between
Fernando and Garth). There are three possible connections (Fernando-Garth; Fernando-Ike; GarthIke). Thus, the Clustering Coefficient for Heather is 1/3.
6) Filtering: Reducing clutter to reveal important features
So far the examples show small, simple networks with only a handful of vertices. Larger, more complex
networks often create cluttered graphs that are hard to interpret. NodeXL includes strategies for making
sense of these larger networks and discovering important features of the data.
Twitter Network Analysis
This section analyzes shakmatt’s Twitter network (available for download from website). If you use Twitter you
may want to import your own dataset. To do so, click the Import button on the NodeXL ribbon and choose
Draft (2/25/2009)
: Network Analysis with NodeXL
- 17 -
From Twitter Network Analysis… To make sure you get enough data, check the box that says “Also show the
friends’ friends” and uncheck the box that says “Include everyone’s latest postings as primary labels.” Click
Start and be patient while it downloads your data (follow the links on Twitter rate limits if needed).
Once you have imported the data (it should be a Directed Graph), make sure to Calculate Graph Metrics (we
have calculated In-Degree, Out-Degree, and Betweenness Centrality) and then Read Workbook. The result may
look something like Fig 23. Using the Zoom and Scale tools above the graph can be helpful when navigating
large graphs such as this one.
Figure 23
Sorting Data: One way to begin to make sense of the data is to sort the Graph Metric fields. In Fig 23 the list of
Vertices has been sorted by Out-Degree so that individuals who have subscribed to many other people are
shown on the top of the list. Sorting on the other measures can give you a quick idea of some of the unique
and important people within the network.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 18 -
Dynamic Filters: Filtering out certain edges or nodes so they don’t show up on the graph is a good way to
reduce clutter. One way to do this is to click on the Dynamic Filters button just above the graph (you may have
to click on the downward pointing arrow on the upper-right hand side of the graph to access the Dynamic
Filters button). You should see a new window like the one shown in Figure 24. The box offers a number of
double box range sliders to help you filter. The number on the left-hand side is the minimum value found in
the workbook, while the number on the right-hand side is the maximum value. The top set of sliders filter out
Edges, while leaving in the Vertices. The second set of sliders filter out the vertices and all edges that point to
those vertices. When items are filtered, they are still read into the graph and will show up if you click on the
corresponding vertex or edge in the data portion of the spreadsheet.
Figure 24
The following Figures demonstrate some ways you can use dynamic filters to better understand your data and
create less cluttered graphs.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 19 -
You’ll notice that in Fig 23, there are so many edges
connected to the vertex on the left-hand side that it is
hard to determine the direction of the edges. Click on
the down arrow next to the Vertex 1 Out-Degree so
that it changes from 979 to 978. This removes the edges
that are associated with any vertices that are in Vertex
1 and have an Out-Degree of 978 or less. In practice,
this deletes the edges coming out of the vertex with the
highest Out-Degree (vielmetti). The image should now
look like Figure 25. All of the vertices are still shown,
but the edges that point to them are hidden. You can
now tell that only two edges are pointing toward the
vertex and track down who they are from (shakmatt
and libbyh).
Figure 25
You may want to focus in on the relationships between
shakmatt’s friends. One way of doing this is to change
the Vertex 2 Out-Degree from 0 to 1. This hides all
edges originating from people that were found in
Vertex 2 who are not following another person in the
network. The result is shown in Fig 26, a graph that
shows edges between all of shakmatt’s friends except
one who has not subscribed to anyone at all. This graph
allows you to more easily identify shared friendships
and cliques.
Figure 26
Draft (2/25/2009)
: Network Analysis with NodeXL
- 20 -
You may be wondering if there is any value to showing
all of the unconnected vertices on the graph in Fig 26.
While it does give a sense of the broader network, you
may not care. You can hide the disconnected vertices by
changing the minimum Out-Degree value (found in the
Vertex Filters section) from a 0 to a 1. The result is
shown in Fig 27.
Try clicking on the node at the far left (vielmetti). This
displays all of the edges coming out of the node in red.
This demonstrates that the information is still included
in the graph, it is just hidden by the filters. This is also
apparent if you change the Layout Type (e.g., to Circle)
and click on Lay Out Again. You’ll notice that the layout
is based on all of the vertices, not just those currently
showing.
Figure 27
The Dynamic Filters are quiet versatile allowing you to hide edges and vertices based on network metrics
(notice the Betweenness Centrality measures in Fig 24), the X and Y coordinates, or any fields that you have
entered in yourself (e.g., Age in the party example). If you want to reset the values to their minimum and
maximum values you can easily do so by clicking on the Reset All Filters button in the Dynamic Filter’s window
(see Fig 24). Likewise, if you update the workbook data you can refresh the Dynamic Filter data with the
Refresh button (Fig 24).
Filtering Using the Worksheet Visibility Column: It is also possible to filter out edges or vertices using the
Visibility column found on the Edges and Vertices tabs. Go to the Vertices tab and find the column titled
Visibility. Clicking on the downward pointing arrow to the right of cell O2 shows the available options (see Fig
28). Show if in an Edge (1) (the default) will only show vertices included in at least one edge. Show (4) indicates
that the edge will show up on the graph even if it is not included in an edge. Skip (0) indicates that the vertex
will not be read into the graph. It is as if the vertex does not even exist in the worksheet. Hide (2) indicates that
the vertex should be read into the graph, but not displayed unless selected (in which case it will show up in
red). The Dynamic Filters use the Hide function as you saw before. The options for filtering edges (found on the
Edges tab) are similar. Hide (2) and Skip (0) work on edges just like they do for vertices, while Show (1) will
show the edge in all cases.
Using Formulas to Filter and Skip: One of the most powerful features of NodeXL is that you can use formulas to
calculate the values that are used in the various display fields. To demonstrate how this can work go to the
Vertices tab and click on cell O2. We are going to create a conditional statement that will Skip all vertices that
have an Out-Degree of 0. This focuses attention on those people that shakmatt is subscribed to who have also
subscribed to at least one other person. To enter the formula, type in “IF(” and then use the left arrow key to
move over to cell C2 (Out-Degree). Next, type in “>0,1,0)” and hit enter. The entire Visibility column will be
Draft (2/25/2009)
: Network Analysis with NodeXL
- 21 -
populated with the same formula (see Fig 28 just above the column letters for the correct formula). The
formula says to look up the Out-Degree number from the appropriate row and compare it to 0. If it is greater
than 0 (i.e., if TRUE), then enter the value 1 to indicate the Show if in an Edge option. If it is not greater than 0
(i.e., if FALSE), then enter in the value 0 to indicate that it should be Skipped. Select Layout Type as Circle and
click on Read Workbook to get an image like the one in Fig 28.
Figure 28
Fig 27 and Fig 28 show the same vertices and edges, but Fig 25 has Hidden the other vertices and edges, while
Fig 28 has Skipped them. This means that if you click on a vertex in Fig 28 you will not see all of the other
hidden vertices it is connected to like you did in Fig 27. Their information is not read into the graph.
Furthermore, Fig 28 looks nice when using a Circle Layout because it bases the layout on the selected vertices.
In Fig 27 if you tried the Circle Layout it would position all of the vertices in a circle including those that are
hidden.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 22 -
7) Clustering: Identifying and displaying vertex clusters
It is often helpful to identify vertices that are clustered together into subgroups of interest. Sometimes you will
know which people should be classified into different clusters (e.g., Republicans versus Democrats), while
other times you may want to identify clusters that you don’t know to look for ahead of time (e.g., friendship
cliques within a large social network). NodeXL allows you to create your own clusters manually. It can also help
automatically identify clusters of interest for you. Once identified, the color and shape of the vertices can be
customized to visually display the clusters. To demonstrate how clusters work, you will get to analyze the
voting patterns of U.S. Senators in the year 2007. You will also get a chance to put together some of the
concepts you’ve learned earlier. Special thanks to Chris Wilson of Slate Magazine for providing the dataset.
2007 Senate Voting Analysis
To get started, download and open the Senate_Raw.xlsx file from website. The Vertices worksheet includes
data about each Senator including their party affiliation, the State they represent, and the total number of
votes they cast in 2007. The Edges worksheet includes an undirected edge list connecting each senator to each
other senator. Columns I-L in Fig 29 indicate the total number of votes that were the same (i.e., both voted Yea
or both voted Nay) (column I), the total number of votes cast by the person in Vertex1 (column J) and Vertex2
(column K), and the percent agreement (column L) [The lowest of the two Total Votes (columns J and K) is used
as the denominator to help deal well with data from frequent absentees (e.g., those campaigning)].
Figure 29
Draft (2/25/2009)
: Network Analysis with NodeXL
- 23 -
Try reading the workbook. You should see a large black mass of connections (see Fig 29). This is because every
senator is connected to every other senator at least once. To make sense of the data you will need to filter
some of the edges and change some of the visual components.
Start by changing the color of all of the edges to be Light Grey by selecting it from the drop-down menu in cell
C2 and copying it down (move the cursor to the lower-right corner of the cell until you see a black plus sign
and then double click to automatically fill in all of the cells below). Next, open the AutoFill Columns window
and select the fields that match those in Fig 30. Set the Option for Edge Visibility to “Greater Than 0.65” (the
average agreement percentage between all pairs). The result is that pairs who voted the same less than 65% of
the time will not be connected in the graph (i.e., they will be Skipped, not Hidden as discussed in the Filtering
Section). If they were Hidden, then calculations of graph metrics and clusters would not work later. Read the
workbook again to see a graph that looks like Fig 31.
Figure 30
Draft (2/25/2009)
: Network Analysis with NodeXL
- 24 -
Figure 31
Creating Clusters Manually: To manually create a cluster, go to the Cluster
Vertices worksheet (Fig 32). Copy and paste the Vertex column from the
Vertices worksheet into column B (Vertex). Then copy and paste the Party
column from the Vertices worksheet to Column A (Clusters). Each of the
Vertices is now assigned to a cluster based on their Party affiliation.
Go to the Clusters worksheet and type in the information shown in Fig 33. This
determines what the vertices assigned to each cluster will look like. Make sure
the Clusters listed in Column A include all of the unique values in the Cluster
Vertices worksheet (Fig 32). When “Show Clusters” is checked in the NodeXL
ribbon, the color and shape specified on the Clusters worksheet will override
any color or shape information found in the Vertices worksheet.
Figure 33
Draft (2/25/2009)
: Network Analysis with NodeXL
Figure 32
- 25 -
Read the workbook to see a graph that looks something like Fig 34. You’ll notice a clear clustering between the
Republican and Democratic Senators.
Figure 34
Creating Clusters Automatically: NodeXL includes the capability to automatically identify clusters. Currently the
algorithm described in the article "Finding Community Structure in Mega-scale Social Networks" by Ken Wakita
and Toshiyuki Tsurumi is used to create clusters. Click on the Create Clusters button on the NodeXL ribbon.
This will replace the data you manually entered on the Clusters and Cluster Vertices worksheets with the
automatically generated clusters. Each cluster is given a numerical ID that is shown in Column A of both
worksheets (e.g., see Fig 35). Colors are automatically assigned to each cluster (Fig 35).
Figure 35
Draft (2/25/2009)
: Network Analysis with NodeXL
- 26 -
Go to the Cluster Vertices worksheet to see which vertices are assigned to which cluster. To view the results,
you’ll need to make sure the Show Clusters box is checked in the NodeXL ribbon and then Read the workbook.
Figure 36 shows the result. The graph shows that the clustering algorithm was able to identify the two most
distinct groups, although the automatically assigned colors are not what people would expect (i.e., the
Republican cluster is now blue and the Democratic cluster is now yellow). You can fix these colors by choosing
more appropriate ones from the drop-down menu in the Vertex Color column on the Clusters worksheet (see
Fig 35).
There are also differences in which cluster some of the individuals were assigned to. The automatic algorithm
created a single person cluster (Collins) because he did not fit well into either of the other clusters (although
he considers himself a Republican). The algorithm also grouped Snowe and the two independent Senators
(Lieberman and Sanders) in the Democratic cluster even though they are not technically Democrats. The
number of clusters is not predetermined (i.e., it won’t always be 3). Likewise the number of vertices in each
cluster can vary significantly.
Figure 36
Showing and Hiding Clusters: You don’t need to show the cluster information on the graph. To hide this
information from the graph, uncheck the Show Clusters box on the NodeXL ribbon and re-read the workbook.
The clustering information will be retained on the Cluster worksheets, but the visual display will be determined
by what is on the Vertices worksheet. In this case it will revert back to looking like Fig 31.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 27 -
You may want to experiment with the dataset to practice some of the features introduced earlier. For
example, you can calculate metrics to find the individuals with the highest Betweeness Centrality or other
metrics of interest. You could also adjust the Edge Visibility option to a different number (e.g., 50% agreement)
or use the dynamic filters to see how the network changes as the variables are modified.
Draft (2/25/2009)
: Network Analysis with NodeXL
- 28 -
Download