NodeXL_tutorial_draft - ccb

Network Analysis with NodeXL: Learning by Doing Derek Hansen, Ben Shneiderman, Marc Smith Draft Version (2/25/09) Please do not distribute Draft (2/25/2009) : Network Analysis with NodeXL - 1 - Network Analysis with NodeXL: Learning by Doing 1) Basic: Getting started with NodeXL Get started by opening NodeXL at the Basic Layer which shows the usual Excel menu bar is across the top, a blank workbook on the left, and a graph pane on the right (Fig 1). The idea of NodeXL is that users fill in the columns in the Edges worksheet with an edge list consisting of vertex pairs that are related to each other. Figure 1 Data entry: One way to begin using NodeXL is to type in your own edge list. For example, you might type the name of an individual who invited another person to a party in Vertex 1 and the name of the person who was invited in Vertex 2 (Fig 2). Figure 2 Draft (2/25/2009) : Network Analysis with NodeXL - 2 - Reading the workbook to display the graph: Then when you click on the Read Workbook button directly above the graph pane, a network showing the party invitations is displayed (Fig 3). Note that our example assumes directed relationships, that is, Ann invites Bob, but Bob has not invited Ann to a party. The arrows point from the inviters (people in Vertex 1) to the invitees (people in Vertex 2). Figure 3 Highlighting an edge: Clicking on one of the workbook rows will highlight an edge and the two vertices in the graph. For example, clicking on row 5 highlights the edge connecting Ann to Carol (see Fig 4). You can even click on multiple rows an all related edges and vertices will be highlighted. Figure 4 Draft (2/25/2009) : Network Analysis with NodeXL - 3 - Importing an edge list: Another way to begin using NodeXL is to use the Import command to load an edge list supplied by someone else. The Import Command is found on the NodeXL Ribbon (see Fig 5) along with other NodeXL specific commands. Someone may provide you with an edge list in the form of a Pajek file (another social network analysis program) or in a standard Excel workbook. Alternatively, you can cut and paste from another Excel spreadsheet to fill in the edge list. Figure 5 Additional import options (e.g., importing an email or Twitter network) are also available (see Fig 5). Hiding and Resizing the Graph Pane: As you work with the data you may want to temporarily hide the graph pane. You can do this by clicking on the Show/Hide Graph Pane button in the NodeXL ribbon. You can resize the pane by moving the cursor to the left side of the pane until you see the ↔ symbol and then dragging it to the desired size. It is also possible to undock the pane or move it above the worksheet data by clicking on the title that reads “Document Actions” and dragging it around. 2) Layout: Arranging Vertices in the Graph Pane Manual Layout: Let’s return to our example of the 8 party invitations. You may have had trouble making sense of Fig. 3 because the vertices are in disarray (they were placed on the graph according to the default layout type Fruchterman-Reingold). You can click and drag the vertices one at a time to create arrangements that emphasize structures or create a more orderly display (Fig 6). You can even select multiple vertices by drawing a box around them or clicking on additional vertices while holding down the Control key. If multiple vertices are selected they will all move together when dragged. Figure 6 Draft (2/25/2009) : Network Analysis with NodeXL - 4 - Automatic Layout: NodeXL also offers several automatic Layout Types that can be selected from the control in the graph pane (which has the label Document Actions), such as the Circle Layout (Fig 7). Figure 7 Experimenting with different Layout Type (e.g. spiral, grid, Sugiyama) can reveal useful patterns, relationships, or unusual features. Undirected Graph Type: In NodeXL, the default Graph Type is Directed, which means there is a relationship between Vertex 1 and Vertex 2 that may not exist between Vertex 2 and Vertex 1. In the example with party invitations, the meaning is that Ann has invited Bob, but Bob has not invited Ann to a party. This is shown in the graph pane as an arrow from Ann to Bob. NodeXL also allows users to specify the Graph Type as Undirected, which means that relationships are equal between people, for example the relationship might be that they have spoken to each other. You can select the Graph Type on the left-hand side of the NodeXL ribbon (see Fig 5). Then after clicking on Read Workbook, the new network display is shown using links between vertices without arrows (see graph pane in Fig 8). Figure 8 Draft (2/25/2009) : Network Analysis with NodeXL - 5 - Updating the Graph Pane: Any time you change the underlying data or features that affect layout (e.g., directed versus undirected), you must click on the Read Workbook button to update the graph. If you just want to change the layout you can select a new layout type and click on Lay Out Again to reduce processing time. Zooming and Scale: To get a closer look at a subsection of a graph you can use the Zoom slider (or a mouse scrollbar after first clicking somewhere in the graph pane). Once you are zoomed in you can pan across the graph by holding down the Spacebar, clicking the mouse button, and dragging the curser in the direction you want to pan. You can also use the Scale slider to change the size of the canvas, which makes the vertices and edges get larger or smaller. 3) Visual Design: Making network displays meaningful Drawing a meaningful graph can reveal patterns, relationships, and interesting features that may be hard to spot in a tabular edge list. NodeXL is designed to enable you a rich variety of possible drawings for a graph. Node Colors: You may want to change the colors of nodes. For example, in the friendship graph, you might want to color nodes that represent men with blue and the women with pink. Look at the tabs on the lower left and click on the Vertices tab, which will bring up the list of 8 vertices (also called nodes) in our party invitation data set. You can select colors for the 3 women and 5 men, then click on Read Workbook to redisplay the Graph Pane (Fig 9). Figure 9 Draft (2/25/2009) : Network Analysis with NodeXL - 6 - Fixing Node Placement: If you want to the placement of the nodes to remain fixed in their position, find the “Locked?” column on the Vertices tab and choose “Yes (1)” (or just “1”) for each of the vertices. This prevents the vertices from moving when you click on Read Workbook again. You can also use the two columns labeled X and Y to fine-tune vertex placement so they are exactly aligned (see Fig. 11). For example, it may be helpful to have the same Y-value (which controls vertical placement) for all of the top level nodes in Fig. 9. After changing these values and locking them you will need to click on Read Workbook again to update the graph. Subgraph Images: Another NodeXL feature is the drawing of Subgraphs for each node (second column in Fig 9). This is accomplished by going to the Analysis section on the NodeXL ribbon (Fig. 5) and clicking on the Create Subgraph Images option. The dialog box (Fig 10) lets you control the properties of the drawing and where to put the results. Note that this feature can be computationally intensive for large data sets. Figure 10 Draft (2/25/2009) : Network Analysis with NodeXL - 7 - Adding Descriptive Data: If you have additional information about the people in the data set, you can add your own columns of data by typing (or pasting it in). You might want to record the age of each person, so scroll the Vertices worksheet to the right until you see the column header “Add your own Columns Here”. Place the cursor on this header to get further instructions. If you select the next free column, you can type an attribute name (e.g., Age) and then enter values for each person. In Fig. 11 we have added two new columns, one for Age and one for number of Prior Parties the individual has attended since the beginning of the year. Figure 11 Changing Vertex Size (and other properties): Another visual property that can be used to encode attribute values is node size, which is controlled by the Radius column in the Vertices worksheet. Put your curser over the Radius column header to show the type of data that must be entered – in this case numbers 1-10 (Fig. 12). Use this same approach to see what type of data to enter into any of the different fields such as Shape, Color, and Opacity. Figure 12 Draft (2/25/2009) : Network Analysis with NodeXL - 8 - There are three ways to enter numbers into the Radius column (or other visual attributes such as Opacity or Color): (1) You can manually type them in, (2) you can enter a formula that calculates a number for the radius based on some other data (e.g., the Prior Party field you entered earlier), or (3) you can use the AutoFill feature to let NodeXL fill in the column based on some other data (e.g., the Prior Party column). Figure 13 shows the result of using the NodeXL AutoFill feature to automatically fill in the Radius numbers based on the Prior Party data we entered earlier. Figure 13 AutoFilling Columns: To recreate Figure 13, first click on the AutoFill Columns button in the NodeXL ribbon. The resulting Dialog box (Fig. 14) offers a set of drop-down boxes to allow you to select data you have entered in as additional fields. Click on the downward pointing arrow next to Vertex Radius to see all of the data columns you have entered in and choose Prior Parties (instead of Age). Notice that you can do the same for many other visual attributes of the Vertices as well as the Edges. Those associated with vertices populate columns in the Vertices worksheet, while those associated with edges populate columns on the Edges worksheet. Note that the column data won’t show up until after you click on Read Workbook. Draft (2/25/2009) : Network Analysis with NodeXL - 9 - Figure 124 Each attribute has an associated Options page that allows you to fine-tune some of the attributes. In our example, we want to assure that the Nodes are large enough to view well, so we can click on the “…” button in the Options column for the Vertex Radius row (Fig. 15). Here we can change the smallest number in the column from the default value of 1.0 to 2.0 so the nodes will all be sufficiently large to see well. Draft (2/25/2009) Figure 15 : Network Analysis with NodeXL - 10 - Changing General Graph Appearance: Another way of setting visual features is to go to the Graph Pane and click on the Options button (or right click in the Graph Pane and select Options) to bring up the Options Dialog Box (Fig 16). It offers controls for setting visual features for Vertices, Selected Vertices, Edges, Selected edges, Fonts, Margins, etc. Note that some of these (e.g., Radius, Opacity, Color) will be superseded by numbers in the corresponding columns on the Vertices or Edges worksheets if they are populated. Figure 136 Draft (2/25/2009) : Network Analysis with NodeXL - 11 - 4) Labeling: adding text labels to nodes and links Adding Labels: You can invoke the AutoFill Columns feature again to fill the Primary Labels column with the names from the Vertex column. Then when you Read the Workbook, the nodes become filled with the labels (Fig 17). The color coding remains but the size coding is no longer possible. In this case the Pink color made the text too light to read easily, so the color was Pink was changed to Deep Pink. Figure 17 Using Secondary Labels: It is possible to use labels and other characteristics such as Radius using the Secondary Label column. To re-create Figure 17 you can use the AutoFill feature to fill the Secondary Label with the Vertex column (and remove the Primary Label). This will result in the radius and label showing up at the same time. To make the graph and labels more readable you can set the maximum radius to be 6.0 instead of the default 10.0 (see Fig. 15). You may also want to make the Edges semi-transparent so labels that overlap with them will be more readable. Do this by filling the Opacity column on the Edges worksheet with 3.0s. Draft (2/25/2009) : Network Analysis with NodeXL - 12 - Figure 18 Adding Tooltips: You can also add data that only shows up when you mouse over a vertex. This is called a Tooltip. In Figure 18 we have AutoFilled the Tooltip column with the Age column and when you mouse over Helen you will see her age (22 in this case). Draft (2/25/2009) : Network Analysis with NodeXL - 13 - 5) Graph Metrics: Calculating and visualizing metrics When analyzing networks, network analysts often want to identify important nodes, locate subgroups, or get a sense of how interconnected a network is compared to other networks. While visualization itself can help do this, it is often helpful to use graph metrics that provide quantitative measures that characterize various aspects of a graph. NodeXL can calculate several graph metrics for you. Once calculated, you can use the graph metrics to change the visual display of your network graphs in powerful ways. Selecting Graph Metrics: You must first choose what graph metrics you want to calculate. To do this, click on the downward pointing arrow next to the Calculate Graph Metrics button on the NodeXL ribbon and then choose Select Graph Metrics. This will open up the dialogue box in Figure 19 that shows you the available graph metrics. Select the ones you want to calculate by checking in the boxes next to them. Clicking on the Details link next to them will explain what the metrics mean. For now, click on the Select All button and choose OK. Figure 19 Calculating Graph Metrics: To calculate the graph metrics you will need to then click on the Calculate Graph Metrics button in the NodeXL ribbon. Some of the graph metrics can take a while to calculate when working with large networks. When all of the graph metrics are done calculating, NodeXL adds new columns will be added to the Edges and Vertices tabs. NodeXL also adds content to the Overall Metrics tab. Draft (2/25/2009) : Network Analysis with NodeXL - 14 - Kite Network Example To better understand the meaning of the various graph metrics, we will use a network called the Kite Network, created by David Krackhardt (see http://www.orgnet.com/sna.html). If you want to save the party file that you have created, do so now by choosing Save As… and then selecting Excel Workbook followed by the name of the file (e.g., Party_NodeXL.xlsx) and then closing out of it. If you have downloaded the Kite_Network.xlsx file from (website) you can then Open the file. If you have not, you can enter the names found in the edge list shown in Fig 20. Make sure you choose “Undirected” in the Graph Type as shown in Fig 20. Once you have opened the file, calculate all of the graph metrics as described above in this section. Overall Metrics: Go to the Overall Metrics tab (Fig 21). This tab summarizes some of the key properties of the entire network including the graph type (Undirected vs Directed), the number of unique edges, edges with duplicates, and total edges. For some networks the Edges with Duplicates is an important measure since two vertices may be joined by more than one edge. This is not the case for the current network. The Self-Loops row shows the number of vertices that link to themselves. A self-loop occurs when the edge list includes the same exact name in the Vertex 1 and Vertex 2 columns on the Edges tab (i.e., a person is connected to themselves). The tab also shows the total number of Vertices, the NodeXL version that it was created using, and the Graph Density. The Graph Density is a number between 0 and 1 that is an overall measure of the Figure 20 inter-connectedness of the vertices. For a undirected graph where all vertices are connected to all others through at least one edge, the Graph Density is calculated by dividing the number of Total Edges (18 in the example) by the maximum number of possible edges (45 in the example), which equals 0.4 for the example. A more dense graph (e.g., 0.6) would include more Total Edges, while a more sparse graph (e.g., 0.2) would include fewer. Figure 21 Draft (2/25/2009) : Network Analysis with NodeXL - 15 - Degree, Centrality, and Clustering Coefficients: Next, go to the Vertices tab. You will notice there are several new columns that were inserted next to the Vertex column after you Calculated Graph Metrics (see Fig 21). Each of these metrics relate directly to one of the vertices. For example, row 2 shows the various graph metrics that are specific to Andre (Fig 22). Below is a brief description of some of these metrics. Figure 22 Degree (Column F): The Degree of a vertex (sometimes called Degree Centrality) is a count of the number of edges that are connected to it. Notice that Diane has a Degree of 6 because she is directly connected to 6 other individuals. In comparison, Jane has the lowest Degree with only one connection to another person. If the edges represented strong friendship ties of individuals in a class, we might say that Diane is the most popular person in the class and Jane is the least popular. If we were using an undirected graph (such as the Party Network), the single Degree metric would be split into two metrics: (1) In-Degree, which measures the number of edges that point toward the node of interest, and (2) Out-Degree, which measures the number of edges that the node of interest points toward. Note that in Fig 22 we have used the AutoFill Columns feature to set the Radius of each vertex to a number between 2 and 7. Betweenness Centrality (Column B): While being popular may be important, it is not everything. Let us consider Heather for a moment. Notice that she is only directly related to 3 other people (i.e., she has a degree of 3). Despite her relatively low Degree, her position as a “bridge” between Ike (and indirectly Jane) to the rest of the group may be of utmost importance. If, for example, information were passed from one person to another, Heather would be vital for assuring that Ike and Jane could communicate with the rest of the group. In fact, if she was removed from the network, Ike and Jane would be disconnected from the other class members. Thus, Heather has high Betweenness Centrality. In contrast, Ed has a Betweenness Centrality of 0. Notice that if he were removed from the Draft (2/25/2009) : Network Analysis with NodeXL - 16 - graph everyone would still be connected to everyone else and their shortest communication paths would not even be altered. More generally, vertices that are included in many of the shortest paths between other vertices have a higher Betweenness Centrality than those that are not included. In Fig 22 you can see the AutoFill Columns feature has set the Opacity of each vertex to a number between 4 and 10, which is why Heather shows up the darkest and Ed and Carol show up the lightest. Closeness Centrality (Column C): Another characteristic you may care about is how close each person is to the other people in the network. If information flowed through edges in the network, some people would be able to contact all the other people in only a few steps, while others may require many steps. Closeness Centrality is a measure of the average shortest distance from each vertex to each other vertex. In the Kite Network, Fernando and Garth have the lowest Closeness Centrality measure, suggesting that they may be in a good position to spread information through the network efficiently. Fig 22 shows the AutoFill Columns feature has set the Tooltip to the Closeness Centrality metric (notice the number that shows up when mousing over Fernando). Eigenvector Centrality (Column D): In many cases, a connection to a popular individual is more important than a connection to a loner. The Eigenvector Centrality metric attempts to take into consideration not only how many connections a node has (i.e., its Degree), but also the Degree of the nodes that it is connecting to. Both Heather and Ed have a Degree of 3. However, Ed is directly connected to Diane, the most popular person in the class, whereas Heather is connected to Ike who is among the least popular. This helps explain why the Eigenvector Centrality metric for Heather is lower than it is for Ed. Clustering Coefficient (Column E): In some cases, a person’s friends may be friends with each other, creating a clique. For example, Ed’s three friends Beverly, Diane, and Garth are all directly connected to one another (i.e., they create a complete graph). In other cases, a person’s friends may not be friends with one another. For example, Ike’s two friends Heather and Jane are not friends with each other. The Clustering Coefficient measures how connected a node’s neighbors are to one another. More specifically, it is the number of edges connecting a node’s neighbors divided by the total number of possible edges between the node’s neighbors. For example, Heather’s three neighbors are Fernando, Garth, and Ike. Only one connection exists between any of them (the connection between Fernando and Garth). There are three possible connections (Fernando-Garth; Fernando-Ike; GarthIke). Thus, the Clustering Coefficient for Heather is 1/3. 6) Filtering: Reducing clutter to reveal important features So far the examples show small, simple networks with only a handful of vertices. Larger, more complex networks often create cluttered graphs that are hard to interpret. NodeXL includes strategies for making sense of these larger networks and discovering important features of the data. Twitter Network Analysis This section analyzes shakmatt’s Twitter network (available for download from website). If you use Twitter you may want to import your own dataset. To do so, click the Import button on the NodeXL ribbon and choose Draft (2/25/2009) : Network Analysis with NodeXL - 17 - From Twitter Network Analysis… To make sure you get enough data, check the box that says “Also show the friends’ friends” and uncheck the box that says “Include everyone’s latest postings as primary labels.” Click Start and be patient while it downloads your data (follow the links on Twitter rate limits if needed). Once you have imported the data (it should be a Directed Graph), make sure to Calculate Graph Metrics (we have calculated In-Degree, Out-Degree, and Betweenness Centrality) and then Read Workbook. The result may look something like Fig 23. Using the Zoom and Scale tools above the graph can be helpful when navigating large graphs such as this one. Figure 23 Sorting Data: One way to begin to make sense of the data is to sort the Graph Metric fields. In Fig 23 the list of Vertices has been sorted by Out-Degree so that individuals who have subscribed to many other people are shown on the top of the list. Sorting on the other measures can give you a quick idea of some of the unique and important people within the network. Draft (2/25/2009) : Network Analysis with NodeXL - 18 - Dynamic Filters: Filtering out certain edges or nodes so they don’t show up on the graph is a good way to reduce clutter. One way to do this is to click on the Dynamic Filters button just above the graph (you may have to click on the downward pointing arrow on the upper-right hand side of the graph to access the Dynamic Filters button). You should see a new window like the one shown in Figure 24. The box offers a number of double box range sliders to help you filter. The number on the left-hand side is the minimum value found in the workbook, while the number on the right-hand side is the maximum value. The top set of sliders filter out Edges, while leaving in the Vertices. The second set of sliders filter out the vertices and all edges that point to those vertices. When items are filtered, they are still read into the graph and will show up if you click on the corresponding vertex or edge in the data portion of the spreadsheet. Figure 24 The following Figures demonstrate some ways you can use dynamic filters to better understand your data and create less cluttered graphs. Draft (2/25/2009) : Network Analysis with NodeXL - 19 - You’ll notice that in Fig 23, there are so many edges connected to the vertex on the left-hand side that it is hard to determine the direction of the edges. Click on the down arrow next to the Vertex 1 Out-Degree so that it changes from 979 to 978. This removes the edges that are associated with any vertices that are in Vertex 1 and have an Out-Degree of 978 or less. In practice, this deletes the edges coming out of the vertex with the highest Out-Degree (vielmetti). The image should now look like Figure 25. All of the vertices are still shown, but the edges that point to them are hidden. You can now tell that only two edges are pointing toward the vertex and track down who they are from (shakmatt and libbyh). Figure 25 You may want to focus in on the relationships between shakmatt’s friends. One way of doing this is to change the Vertex 2 Out-Degree from 0 to 1. This hides all edges originating from people that were found in Vertex 2 who are not following another person in the network. The result is shown in Fig 26, a graph that shows edges between all of shakmatt’s friends except one who has not subscribed to anyone at all. This graph allows you to more easily identify shared friendships and cliques. Figure 26 Draft (2/25/2009) : Network Analysis with NodeXL - 20 - You may be wondering if there is any value to showing all of the unconnected vertices on the graph in Fig 26. While it does give a sense of the broader network, you may not care. You can hide the disconnected vertices by changing the minimum Out-Degree value (found in the Vertex Filters section) from a 0 to a 1. The result is shown in Fig 27. Try clicking on the node at the far left (vielmetti). This displays all of the edges coming out of the node in red. This demonstrates that the information is still included in the graph, it is just hidden by the filters. This is also apparent if you change the Layout Type (e.g., to Circle) and click on Lay Out Again. You’ll notice that the layout is based on all of the vertices, not just those currently showing. Figure 27 The Dynamic Filters are quiet versatile allowing you to hide edges and vertices based on network metrics (notice the Betweenness Centrality measures in Fig 24), the X and Y coordinates, or any fields that you have entered in yourself (e.g., Age in the party example). If you want to reset the values to their minimum and maximum values you can easily do so by clicking on the Reset All Filters button in the Dynamic Filter’s window (see Fig 24). Likewise, if you update the workbook data you can refresh the Dynamic Filter data with the Refresh button (Fig 24). Filtering Using the Worksheet Visibility Column: It is also possible to filter out edges or vertices using the Visibility column found on the Edges and Vertices tabs. Go to the Vertices tab and find the column titled Visibility. Clicking on the downward pointing arrow to the right of cell O2 shows the available options (see Fig 28). Show if in an Edge (1) (the default) will only show vertices included in at least one edge. Show (4) indicates that the edge will show up on the graph even if it is not included in an edge. Skip (0) indicates that the vertex will not be read into the graph. It is as if the vertex does not even exist in the worksheet. Hide (2) indicates that the vertex should be read into the graph, but not displayed unless selected (in which case it will show up in red). The Dynamic Filters use the Hide function as you saw before. The options for filtering edges (found on the Edges tab) are similar. Hide (2) and Skip (0) work on edges just like they do for vertices, while Show (1) will show the edge in all cases. Using Formulas to Filter and Skip: One of the most powerful features of NodeXL is that you can use formulas to calculate the values that are used in the various display fields. To demonstrate how this can work go to the Vertices tab and click on cell O2. We are going to create a conditional statement that will Skip all vertices that have an Out-Degree of 0. This focuses attention on those people that shakmatt is subscribed to who have also subscribed to at least one other person. To enter the formula, type in “IF(” and then use the left arrow key to move over to cell C2 (Out-Degree). Next, type in “>0,1,0)” and hit enter. The entire Visibility column will be Draft (2/25/2009) : Network Analysis with NodeXL - 21 - populated with the same formula (see Fig 28 just above the column letters for the correct formula). The formula says to look up the Out-Degree number from the appropriate row and compare it to 0. If it is greater than 0 (i.e., if TRUE), then enter the value 1 to indicate the Show if in an Edge option. If it is not greater than 0 (i.e., if FALSE), then enter in the value 0 to indicate that it should be Skipped. Select Layout Type as Circle and click on Read Workbook to get an image like the one in Fig 28. Figure 28 Fig 27 and Fig 28 show the same vertices and edges, but Fig 25 has Hidden the other vertices and edges, while Fig 28 has Skipped them. This means that if you click on a vertex in Fig 28 you will not see all of the other hidden vertices it is connected to like you did in Fig 27. Their information is not read into the graph. Furthermore, Fig 28 looks nice when using a Circle Layout because it bases the layout on the selected vertices. In Fig 27 if you tried the Circle Layout it would position all of the vertices in a circle including those that are hidden. Draft (2/25/2009) : Network Analysis with NodeXL - 22 - 7) Clustering: Identifying and displaying vertex clusters It is often helpful to identify vertices that are clustered together into subgroups of interest. Sometimes you will know which people should be classified into different clusters (e.g., Republicans versus Democrats), while other times you may want to identify clusters that you don’t know to look for ahead of time (e.g., friendship cliques within a large social network). NodeXL allows you to create your own clusters manually. It can also help automatically identify clusters of interest for you. Once identified, the color and shape of the vertices can be customized to visually display the clusters. To demonstrate how clusters work, you will get to analyze the voting patterns of U.S. Senators in the year 2007. You will also get a chance to put together some of the concepts you’ve learned earlier. Special thanks to Chris Wilson of Slate Magazine for providing the dataset. 2007 Senate Voting Analysis To get started, download and open the Senate_Raw.xlsx file from website. The Vertices worksheet includes data about each Senator including their party affiliation, the State they represent, and the total number of votes they cast in 2007. The Edges worksheet includes an undirected edge list connecting each senator to each other senator. Columns I-L in Fig 29 indicate the total number of votes that were the same (i.e., both voted Yea or both voted Nay) (column I), the total number of votes cast by the person in Vertex1 (column J) and Vertex2 (column K), and the percent agreement (column L) [The lowest of the two Total Votes (columns J and K) is used as the denominator to help deal well with data from frequent absentees (e.g., those campaigning)]. Figure 29 Draft (2/25/2009) : Network Analysis with NodeXL - 23 - Try reading the workbook. You should see a large black mass of connections (see Fig 29). This is because every senator is connected to every other senator at least once. To make sense of the data you will need to filter some of the edges and change some of the visual components. Start by changing the color of all of the edges to be Light Grey by selecting it from the drop-down menu in cell C2 and copying it down (move the cursor to the lower-right corner of the cell until you see a black plus sign and then double click to automatically fill in all of the cells below). Next, open the AutoFill Columns window and select the fields that match those in Fig 30. Set the Option for Edge Visibility to “Greater Than 0.65” (the average agreement percentage between all pairs). The result is that pairs who voted the same less than 65% of the time will not be connected in the graph (i.e., they will be Skipped, not Hidden as discussed in the Filtering Section). If they were Hidden, then calculations of graph metrics and clusters would not work later. Read the workbook again to see a graph that looks like Fig 31. Figure 30 Draft (2/25/2009) : Network Analysis with NodeXL - 24 - Figure 31 Creating Clusters Manually: To manually create a cluster, go to the Cluster Vertices worksheet (Fig 32). Copy and paste the Vertex column from the Vertices worksheet into column B (Vertex). Then copy and paste the Party column from the Vertices worksheet to Column A (Clusters). Each of the Vertices is now assigned to a cluster based on their Party affiliation. Go to the Clusters worksheet and type in the information shown in Fig 33. This determines what the vertices assigned to each cluster will look like. Make sure the Clusters listed in Column A include all of the unique values in the Cluster Vertices worksheet (Fig 32). When “Show Clusters” is checked in the NodeXL ribbon, the color and shape specified on the Clusters worksheet will override any color or shape information found in the Vertices worksheet. Figure 33 Draft (2/25/2009) : Network Analysis with NodeXL Figure 32 - 25 - Read the workbook to see a graph that looks something like Fig 34. You’ll notice a clear clustering between the Republican and Democratic Senators. Figure 34 Creating Clusters Automatically: NodeXL includes the capability to automatically identify clusters. Currently the algorithm described in the article "Finding Community Structure in Mega-scale Social Networks" by Ken Wakita and Toshiyuki Tsurumi is used to create clusters. Click on the Create Clusters button on the NodeXL ribbon. This will replace the data you manually entered on the Clusters and Cluster Vertices worksheets with the automatically generated clusters. Each cluster is given a numerical ID that is shown in Column A of both worksheets (e.g., see Fig 35). Colors are automatically assigned to each cluster (Fig 35). Figure 35 Draft (2/25/2009) : Network Analysis with NodeXL - 26 - Go to the Cluster Vertices worksheet to see which vertices are assigned to which cluster. To view the results, you’ll need to make sure the Show Clusters box is checked in the NodeXL ribbon and then Read the workbook. Figure 36 shows the result. The graph shows that the clustering algorithm was able to identify the two most distinct groups, although the automatically assigned colors are not what people would expect (i.e., the Republican cluster is now blue and the Democratic cluster is now yellow). You can fix these colors by choosing more appropriate ones from the drop-down menu in the Vertex Color column on the Clusters worksheet (see Fig 35). There are also differences in which cluster some of the individuals were assigned to. The automatic algorithm created a single person cluster (Collins) because he did not fit well into either of the other clusters (although he considers himself a Republican). The algorithm also grouped Snowe and the two independent Senators (Lieberman and Sanders) in the Democratic cluster even though they are not technically Democrats. The number of clusters is not predetermined (i.e., it won’t always be 3). Likewise the number of vertices in each cluster can vary significantly. Figure 36 Showing and Hiding Clusters: You don’t need to show the cluster information on the graph. To hide this information from the graph, uncheck the Show Clusters box on the NodeXL ribbon and re-read the workbook. The clustering information will be retained on the Cluster worksheets, but the visual display will be determined by what is on the Vertices worksheet. In this case it will revert back to looking like Fig 31. Draft (2/25/2009) : Network Analysis with NodeXL - 27 - You may want to experiment with the dataset to practice some of the features introduced earlier. For example, you can calculate metrics to find the individuals with the highest Betweeness Centrality or other metrics of interest. You could also adjust the Edge Visibility option to a different number (e.g., 50% agreement) or use the dynamic filters to see how the network changes as the variables are modified. Draft (2/25/2009) : Network Analysis with NodeXL - 28 -

NodeXL_tutorial_draft - ccb

Related documents

Products

Support

NodeXL_tutorial_draft - ccb

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib