Analyzing Microtext: Papers from the 2013 AAAI Spring Symposium Characterizing Online Discussions in Microblogs Using Network Analysis 1 Veronika Strnadova1,2 David Jurgens2 and Tsai-Ching Lu2 University of California, Santa Barbara Santa Barbara, California, USA veronika@cs.ucsb.edu HRL Laboratories, LLC Malibu, California, USA {dajurgens,tlu}@hrl.com 2 Abstract the property of the network. For example, we would expect that the conversations around companies would be different than those for sports teams, due to the participants focus and even geographic location. Understanding what the quantitative differences are between types of discussions can enable the categorization of a novel topic, based on the dynamics of the online conversation discussing it. As a part of our initial study, we present an approach to categorizing online topical discussions in Twitter using a network to model the relationships between participants, locations, and discussion points. We propose a new extension to the graph building method of Ruiz et al. (2012) that incorporates a user’s location. Locations enable detecting geographic shifts in a conversation, such as when the discussion around a sports team moves between the cities in which the team plays. We provide an analysis of six network features and demonstrate how each corresponds to different discussion phenomena. The key contributions are as follows. First, we demonstrate that a normalized variant of the network diameter can track changes in the level of topical interest that aren’t necessarily reflected in frequency-based features which look for the most popular elements. Second, we present a case study of graph features that illustrates how the underlying dynamics of locations, participants, and discussion points varies significantly based on the type of topic in discussion. Notably, we find that the temporal changes in the PageRank of the most frequent node types is the strongest discriminating feature between types of discussions. Third, while Ruiz et al. (2012) found that several network features could be strongly correlated with stock market behavior, we find that their network-based features can often be reduced to frequency-based features that do not require the construction of a discussion network. Last, we show that a previously unaddressed aspect of online discussions, the physical locations of the participants, varies significantly by discussion topic and is essential in categorizing the kind of discussion. Online discussions of a specific topic in microblogs may vary widely in their content, locality, and participants. We describe a method for analyzing microblog discussions in Twitter, using a network model in order to characterize discussions by their network properties. Building the network from participants, their messages, their locations, and shared message content, we present an analysis of six network features over four types of discussions. Our analysis reveals that the diameter of the discussion network, when normalized for size, provides a strong indicator for the level of concentration around a few items. In addition, we show that different discussion types show clear patterns in how their entities (e.g., participants, locations) vary over time in importance to the discussion. Our analysis identifies clear measures for quantifying and categorizing novel discussions in terms of their important features and expected dynamics. 1 Introduction The large scale of microblogging activity has given rise to free-form discussions in which participants may join and leave at any time. Furthermore, for a given topic, many factors such as news stories or region-specific interest may drive new users to participate. For example, Romero, Meeder, and Kleinberg (2011) note that topical categories often have very different patterns in how information is shared in their discussions. We present a pilot study aimed at characterizing the types of online discussions based on network features. Our goal is to identify quantitative measures of driving factors behind a discussion and to discover those features which reveal significant differences in the types of discussion. A network model of the discussion enables the linking of participants in a conversation with their messages and shared conversational features, e.g., hyperlinks. Recently, Ruiz et al. (2012) proposed a new method for constructing topical networks from Twitter conversations. They demonstrated how the properties of this network could be used to predict stock price changes. We propose that this type of network construction can also be used to categorize the discussion around a topic by observing the temporal change in 2 Twitter Discussion Network Twitter provides a multifaceted notion of discussion. A single message may be linked to a larger discussion by both explicit and implicit features. User mentions provide an explicit indication of a person-to-person discussion; a message may link to other users through specific mentions of the user’s name with the ”@User” notation. Similarly, a c 2013, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 91 Re-Tweet Location Created In Creates Tweet Hashtag User Mentions Annotated Cites Musicians Sports Teams Companies Countries Lady Gaga Jay-Z Red Hot Chili Peppers Dave Matthews Band Kanye West Dodgers Yankees Cardinals Diamondbacks Padres Microsoft IBM Intel Apple Sony Great Britain United States Uganda Greece Mexico Table 1: The list of terms used to seed the discussion graphs Url data provided by Twitter to establish a retweet. Although the graph is not directly based on the context of the microtext, the URL and hashtag features provide generalized indicators of the tweet’s content. For associating tweets with locations, we use a two step process. First, approximately 1% of all tweets come with geocoordinates, which allows us to accurately recover the location at which a tweet originated. We use the Google Maps Reverse Geocoding service to convert these into a canonical city-level location name. Second, when geocoordinates are not available, we follow the method of Cheng, Caverlee, and Lee (2010), which looks into a user’s profile for a location and uses that as the location of the tweeted message. Location names from user profiles were further normalized to the same set of city labels. Furthermore, using the tweets with geocoordinates as ground truth, we calculated the error for profile-based location names and removed those names with a median error above 25km, ensuring a higher precision at the expense of recall. Ultimately, approximately 32% of all messages were linked with a location. Figure 1: The schema used to construct a discussion graph from individual tweets, extended from Ruiz et al. (2012) Clarissa #cardinals Tweet 4 David Tweet 1 Alice Tweet 3 #baseball Tweet 5 St. Louis, Missouri, USA Tweet 6 Bob #dodgers Los Angeles, California, USA Tweet 2 Figure 2: An example demonstrating the connections between various node types in the discussion graph. user may repost another message, commonly referred to as retweeting, which acts as both a form of diffusing information and engaging in conversation (Honey and Herring 2009; Boyd, Golder, and Lotan 2010). Conversely, hashtags, e.g., “#baseball,” provide an explicit indication of a message joining a larger, potentially global conversation; hashtags enable other users to discover the message by its use of a shared tag (Huang, Thornton, and Efthimiadis 2010). We construct a network in which individual tweets link to four other node types: users, hyperlinks, hashtags, and locations. Figure 1 illustrates the schema used to construct the network from individual tweets, and Figure 2 provides an example instantiation of the schema to illustrate connectivity. As the discussion grows, the network grows more connected to elements that are shared in common between many tweets. For example, a viral news story would generate a graph with many tweets pointing to the same hyperlink and possibly to a tweet that was shared multiple times. Importantly, this network representation of an online discussion enables modeling of shared relations between the features. The discussion may be driven by extrinsic features, such as hyperlinks with common interest, or common geographic location of the participants, which may influence their topic choice. Following the method of Ruiz et al. (2012), to construct the graph, we select a seed keyword and build a graph from all tweets that mention that keyword. We differ slightly from Ruiz et al. (2012) in our treatment of retweeted messages. Whereas Ruiz et al. (2012) link two tweets together if the Jaccard similarity between the tweets message bodies is above a certain threshold, we found that this method overestimates the number of retweets, and instead used the meta- 3 Collective Discourse Analysis Methodology Our pilot study consists of discussions for four topic categories: sports teams, popular musicians, companies, and countries. We hypothesize that each category represents a different type of online discussion with respect to both content and locality. Specifically, the interest around the topic varies in terms of news content, geographic locality, and trending popularity. Our goal is to identify a set of graph measurements that accurately characterize these different types of discussions in order to develop a methodology for characterizing discussions about arbitrary terms. For example, we selected American baseball teams as examples of discussions that are often highly local based on where the team is located, but that are mobile as the team travels to play in opponents’ home cities. For each topic category, we selected five terms, listed in Table 1, that might best exhibit different network characteristics. For each of the terms listed in Table 1 we select all tweets containing that term from a 10% sample of all the Tweets which appear on Twitter. For computational efficiency, if in one day a term has over 15,000 tweets, we sample 15,000 tweets without replacement from the selected tweets. To analyze changes to the graphs over time, we aggregate discussions at the daily level. A daily level enables us to observe multiple discussion points per day that correspond to some variability in the network, while still being 92 able to observe consistent properties and entities that persist between days. For network properties, we selected to observe daily changes in the graph diameter, node type composition, and the degree, PageRank, and closeness distributions of the nodes in each graph. We do not expect these properties to change significantly over the course of a day, but daily changes do help us characterize the discussion. 4 topic, resulting in two large disconnected components. However, our analysis revealed that all of our networks exhibited the same properties: one large connected component, and many small components. The large component was always at least an order of magnitude larger than any of the smaller components, and most of the small components were “singleton” tweets – they consisted of one tweet, one user, and zero to three of the other node types. These small components represent locations, people, and what we might call peripheral topics, not connected in any way to the main theme of the discussion. This behavior resembles the observed component distribution of online social networks, which tend to have a giant component, and many smaller disconnected components (Kumar, Novak, and Tomkins 2010). However, the behavior is surprising given that our network is tweet-based with the majority of connections between tweets coming from intermediary edges from locations, hashtags and hyperlinks, not directly from tweets. Ruiz et al. (2012) show that the number of connected components tends to have very high correlation with changes in traded volume for a stock. However, given our experimental results, the correlation may be due simply to the fact that more people are discussing the keyword which the graph is built around, and reflects tweet volume rather than relations among users, hashtags, and hyperlinks. For example, given a spike in public interest about a topic (e.g., a baseball game), more users will tweet messages containing the keyword. However, the increased volume creates more variety in the observed hashtags, hyperlinks, and locations, and therefore creates more nodes that are not connected to the larger component by virtue of their uniqueness. Indeed, Ruiz et al. (2012) note that the second most predictive feature was the number of nodes, which can be computed by keeping a simple count of node entities, without the need for a graph. In our case studies, for all of a keyword’s graphs, Spearman’s ρ for the number of nodes and the number of components was above 0.9 for nearly every keyword, thereby suggesting that node count alone could be used as an effective substitute. We speculate that the correlation in tweet frequency with increased trading volume may be another example of an already established link between online user behavior with search engines and trading volume (Preis, Reith, and Stanley 2010). Complete-Graph Metrics Several graph metrics serve as descriptors of the graph as a whole. We refer to such measures as “complete-graph” metrics and present the effectiveness of tracking temporal changes in these metrics in order to classify online discussions. Specifically, we describe how changes in the number of connected components, the diameter, and the node type distribution can help us categorize discussions. 4.1 Node Types We hypothesized that differences in the node type composition of the discussion network might reveal what was most important to the discussion. For example, if a discussion includes a large number of hyperlink nodes relative to other days, we may infer that there were external news stories that drove the discussion. Figure 3 illustrates the changes in composition for sample keywords in three of our categories. For each day in July, we plot the percentage of each node type. In our large analysis of all the graphs, we did not observe any major variation in node type composition between the categories; the composition was largely similar for all graphs with a roughly equal distribution of tweets and users, and a similarly equal distribution of hyperlinks and hashtags. We note that this result is not necessarily a result of the graph composition method. For example, if a discussion is generated by a moderately sized group of users that generate the bulk of the messages, then we would expect to see a greater disparity between the percentage of tweets and users. However, despite the lack of categorically-biased distributions, several minor trends do emerge. The graph for Microsoft in Figure 3a reveals a temporal pattern in its hyperlink and hashtag nodes, which we found corresponded to the increase in percentage of hashtags on the weekend. A similar temporal trend was seen in other company-based discussions. Both the discussion for the Diamondbacks baseball team and that for Great Britain exhibit significant variation in the node type distributions. However, as the Olympic Games approaches and begins on July 27th, we see the distribution become much more consistent for Great Britain, which hosted the games. This change highlights the impact that an external event can have on the dynamics of the discussion. 4.2 4.3 Diameter The diameter of a connected component measures the length of the longest path between two vertices in that component, a feature which is not accessible using a frequency-based analysis. We hypothesized that if a conversation is highly concentrated around a few entities, then it is likely that the diameter of the largest component will be small relative to the component size. Conversely, if a discussion is unfocused with many unrelated entities, then the diameter of the largest component will increase. Additionally, because the total volume of discussion entitites varies per day, we use a version of the diameter that is normalized by the total graph size. We use the term “normalized diameter” to refer to the value (maxu,v d(u,v)) , where maxu,v d(u, v) is the value of the dinL(G) Connected Components As a discussion network grows over time through the addition of new tweets, not all tweets are expected to join the same connected component. For example, two news stories may give rise to disjoint sets of users conversing about each 93 0.4 0.3 0.2 0.1 0.6 Location Tweet User Hashtag Url 0.5 Percentage of Type 0.5 Percentage of Type 0.6 Location Tweet User Hashtag Url 0.4 0.3 0.2 0.1 0 0.4 0.3 0.2 0.1 0 5 10 15 20 25 Location Tweet User Hashtag Url 0.5 Percentage of Type 0.6 0 30 5 10 Date in July, 2012 15 20 25 30 5 10 Date in July, 2012 (a) Microsoft 15 20 25 30 Date in July, 2012 (b) Diamondbacks (c) Great Britain Figure 3: A comparison of the distribution of node types in the largest component for Great Britain, Microsoft, and the Diamondbacks baseball team highlights how the focus of a discussion in terms of hashtags, retweets, a shared hyperlinks may vary significantly on a daily basis. 0.012 28 0.01 0.008 26 0.006 24 0.004 22 0.012 0 5 10 15 20 Date in July, 2012 25 22 0.014 20 0.01 0.008 18 0.006 0.004 16 0.002 20 24 0.016 30 (a) Microsoft 14 5 10 15 20 Date in July, 2012 25 0.4 Diameter Normalized Diameter 0.35 20 0.3 18 0.25 16 0.2 14 0.15 12 0.1 10 0.002 8 0 6 Diameter / Number of Nodes 0.014 30 0.018 Diameter Diameter Normalized Diameter 22 Diameter 0.016 32 Diameter 24 0.018 Diameter / Number of Nodes 0.02 Diameter Normalized Diameter 34 Diameter / Number of Nodes 36 0.05 0 30 5 (b) Yankees 10 15 20 Date in July, 2012 25 30 (c) Great Britain Figure 4: Spikes in the normalized diameter show a decentralization of the discussion around a topic that correspond to realworld phenomena. 5 Degree Frequency ameter in the largest connected component, and nL(G) is the number of vertices in the largest connected component of graph G. Figure 4 shows a comparison of the diameter and normalized diameter over time for the keywords “Microsoft”, “Yankees”, and “Great Britain”. We did not observe any changes in the diameter that corresponded to changes in user behavior. However, the normalized diameter does show clear patterns that correspond to real-world phenomena affecting user behavior. Despite being very different keywords, all three show changes in the normalized diameter that correspond to a concentrated discussion. The spikes in normalized diameter in Fig. 4a correspond to weekends during which less news is published on Microsoft as a company. Similarly, the spikes in Fig. 4b to days when the Yankees do not have games and discussion ranges widely. In contrast, the sharp drop in Fig. 4c corresponds to the start of the Olympic Games, where a high percentage of the discussion for Great Britain becomes highly focused around that topic. Microsoft Mets Great Britain Jay-Z 1000 100 10 1 1 10 100 Degree 1000 10000 Figure 5: A log-log plot of degree frequency for all nodes in the largest component during one day’s discussion. versation is highly concentrated around a certain topic, it does not tell us which topic. In order to evaluate changes of a local flavor in network structure, we studied changes in distributions for three measures: degree and PageRank, used by Ruiz et al. (2012), and a new measure, closeness. Graph Node Metrics 5.1 In comparison to the complete-graph metrics in Section 4, graph node metrics track the properties of individual nodes, which may change significantly over time. For example, while the normalized diameter may indicate whether a con- Degree Distribution Newman, Watts, and Strogatz (2002) state that a highly skewed degree distribution which obeys a power law is one of the three distinctive features of social networks. We expected locally-focused discussions to have the highly 94 0.08 pagerank 0.07 0.06 0.05 0.04 0.025 0.12 #np #nowplaying #CelebrityNews #lastfm #LadyGaGa #TeamFollowBack #ladygaga #RetweetTheSongs #NP #RT 0.02 0.015 #yankees #GoRedSox #Mariners #NYY #Ichiro #GoYankees #ASG #MLB #RedSox #Yankees 0.1 0.08 pagerank #BROCADE #VMWare #EMC #SVC #AIX #XIV #storage #Linux #ibm #IBM pagerank 0.1 0.09 0.01 0.06 0.04 0.03 0.02 0.005 0.02 0.01 0 0 10 15 20 25 30 0 5 10 Date in July, 2012 0.014 0.012 0.01 20 25 30 5 0.01 0.008 0.006 0.008 15 20 25 30 UnitedStates, Washington, Washington D.C. United States, Nevada, Las Vegas United States, Pennsylvania, Philadelphia Dominican Republic,Distrito Nacional, Santo Domingo United States, New York, Bronx United States, Illinois, Chicago United States, Washington, Seattle United States, New York, Brooklyn United States, Massachusetts, Boston United States, New York, New York 0.018 0.016 0.014 0.012 0.006 0.01 0.008 0.006 0.004 0.004 0.004 0.002 0.002 0.002 0 5 0.03 10 15 20 Date in July, 2012 25 30 0.06 545788655 201846344 32639033 25044297 557095817 10228272 15812482 18994444 382546408 0.02 0.04 0.015 10 15 20 Date in July, 2012 25 30 0.02 0.005 0.01 10 15 20 Date in July, 2012 25 30 15 20 Date in July, 2012 25 30 259910792 88763317 62812946 273483641 522252139 288764875 40918816 18479513 40927173 0.018 0.016 0.014 0.03 0.01 5 0.02 47092805 96398845 21332196 290874302 69183155 491838627 544273672 14230524 10228272 0.05 pagerank 0.025 0 5 pagerank 0 pagerank 10 Date in July, 2012 Turkey, Istanbul, Unkapan˜– UnitedStates, California, Los Angeles UnitedStates, Illinois, Chicago Brazil, Rio de Janeiro, Rio de Janeiro Philippines, National Capital Region, Manila Mexico, The Federal District, Mexico City United Kingdom, England, London Brazil, Sao Paulo, Sao Paulo UnitedStates, NewYork, NewYork Indonesia, JakartaRaya, Pecenongan 0.012 pagerank 0.016 pagerank 0.014 Philippines, National Capital Region, Manila United States, Illinois, Chicago United States, Washington, D.C., Washington Indonesia, Jakarta Raya, Pecenongan United States, Georgia, Atlanta United States, Texas, Austin United States, California, San Francisco India, Karnataka, Kanija Bhavan United States, New York, New York United Kingdom, England, London 0.018 15 Date in July, 2012 pagerank 5 0.012 0.01 0.008 0.006 0.004 0.002 0 0 5 10 15 20 Date in July, 2012 (a) IBM 25 30 0 5 10 15 20 Date in July, 2012 25 30 (b) Lady Gaga 5 10 (c) Yankees Figure 6: A comparison of the PageRank values for IBM, Lady Gaga, and Yankees for the ten most frequent nodes of types hashtag (top), location (middle), and user (bottom). The temporal changes in the PageRanks of the most frequent nodes reveal distinct types of discussion behaviors that are representative of trends seen for similar terms. skewed distribution characteristic of social networks, because a topic that is specific to a small geographic region may include a substantial amount of users who are socially connected. Figure 5 shows a log-log plot of node degree frequencies on a single day for four representative discussion keywords, all of which resemble a power law, consistent with our hypothesis. We note that the networks could have non-power law distributions if generated by certain types of user behaviors. For example, we originally hypothesized that company terms would generate multiple sub-discussions around their various products, which would yield a more uniform degree distribution. However, the degree distributions were indistinguishable for different discussion types. For each keyword, no significant temporal changes were observed in the degree distributions mean, median, variance, kurtosis, or skewness. found that these properties did not correlate well with any of the stock market events which they tried to predict. Their approach considers statistics such as the quartiles, skewness, and kurtosis of the degree distributions. Their finding is consistent with our observation that the degree distributions are relatively stationary over time and therefore the statistics would not yield additional predictive power. 5.2 PageRank The PageRank of a node measures how well-connected the node is in terms of having important neighbors (Brin and Page 1998). Thus, for graphs representing a topical discussion, we would expect to see that nodes with a large PageRank to be the most important aspects of the discussion. Furthermore, analyzing the changes of the highest ranked nodes over time can reveal what drives the discussion. We illustrate this point through Figure 6, which visualizes the PageRank scores for the most frequent locations, hashtags and users in the respective keyword-graphs for three types of discussions over the course of a month. For hashtags, we see that whereas discussions of IBM and Yankees are driven by a rel- Ruiz et al. (2012) also found that the degree distribution was not a reliable source of information for graphs representing interactions between actors and topics related to the stock market. The authors attempted to track changes in the properties of degree distributions in their networks, but 95 atively stable set of hashtags, the discussions for Lady Gaga are frequently associated with hashtags such as #RetweetThisSong that quickly spike and die down, as well as tags such as #LastFM, where a song by Lady Gaga has been played. The difference in tagging behavior between topics agrees with a recent analysis by Romero, Meeder, and Kleinberg (2011). Key differences are also seen for locations in Figure 6, where the discussions for “IBM” and “Lady Gaga” have highly populous cities in several countries as the highest weighted in their graph, with no clear trend in which city generates the most discussion. Conversely, the discussions for “Yankees” consistently have their home city of the team, New York, as the most central. Other highly ranked locations in the network correspond to other cities with baseball teams, with the locations spiking in PageRank when the Yankees plays a team from that city. The ranking of users reveals that a single user is consistently ranked the highest in the graph, which further analysis showed to be Twitter accounts associated with the company, entertainer, and sports teams, respectively. In addition, both Lady Gaga and the Yankees showed cases where other nonassociated users were central to the location. In the case of the Yankees, these users tended to be accounts associated with the teams they were playing and therefore correlated with the trends seen in location. We would expect the emergence of a single prominent user in the discussion network if that user was highly influential, a driving force behind the discussion. However, we found that the most central user according to the Pagerank measure was merely the user who tweeted most frequently, and was represented by the corresponding user node with the highest degree. Last, as a test of the importance of the PageRank score, we computed Spearman’s ρ between the relative ranking of a node according to PageRank and to its degree. For every graph in our pilot study we observed that ρ was above 0.9, when comparing the nodes in the largest component. Given the high correlation, it appears that although the PageRank provides a useful measure of importance for an entity in a discussion network, a simple frequency-based approach to measuring importance the degree of a node in our network is simply a count of occurrences of the entity represented by the node in the Twitter dataset can provide highly similar ratings with far lower computational costs. 5.3 45 40 35 Frequency 30 25 20 15 10 5 0 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 Closeness 0.3 Figure 7: Distribution of node closeness values in the “Cardinals” graph on July 6, 2012. #CardinalNation #TakeJake #Cubs #stlcards #MLB #ASG #BestTeam #FinalVote #FreesePlease #Cardinals 0.6 closeness 0.5 0.4 0.3 0.2 0.1 0 5 10 15 20 Date in July, 2012 25 30 Figure 8: Closeness values of the top ten hashtags appearing in the “Cardinals” network for the month of July. deal with cases when v is not reachable from u. If we rank all the nodes in the largest component of our graph by closeness, the highest-ranked nodes are closest in path length to all the other nodes in the graph. Thus, if we believe that an online discussion is centralized around one or very few topics, people, or locations, we might expect that the nodes representing these focal points to have the best closeness scores. Examining node importance in this way might be best when considering compact, highly focused local discussions. Unlike PageRank and betweenness, closeness did not appear to correlate highly with degree; Spearman’s ρ between closeness and node degree was less than 0.4 in every case. Some correlation is still expected due to the network construction process: a node with high degree will be connected to other nodes in the graph with greater probability, increasing the chance that it has short path lengths to all other vertices. Unlike the degree and PageRank distributions, however, the distribution of node closeness values resembles a normal (Gaussian) shape, as shown in the distribution in Closeness Closeness centrality is a different measure of node importance than degree, PageRank, or betweenness centrality, measuring which nodes are near the graph’s “center”. Formally, the closeness Cl(v) of a node v ∈ V is defined as the inverse of the mean of the farness to other nodes: 1 Cl(v) = 1 , u∈V dG (u, v) |V | where dG (u, v) is the shortest path between u and v, and |V | the number of vertices, in the graph G. We only consider the largest component of the graph when computing closeness centrality. That is, in our case G represents the largest connected component of the Twitter network, and we need not 96 0.1 pagerank discussion elements, such as events, that are also regionspecific. In identifying what are the most important tweets in a discussion, De Choudhury, Counts, and Czerwinski (2011) consider a related problem of what are the most important to return given a search query for a term. Using microtext, social network, and discussion attributes that were selected based on a user survey, they found that tweets which exemplified diversity in these attributes were among the best to return according to a user assessment. Future work may consider whether a diversity ranking with respect to discussion elements may enable better identification of the key tweets in an evolving online discussion. Recent work has also focused on the related problem identifying location-specific topics (Mei et al. 2006; Wang et al. 2007; Hong et al. 2012). Whereas our analysis asks whether location is an important factor in an online discussion, these models identify the most likely language of interest for specific locations by leveraging microtext that has been geocoded. Unlike our analysis, these models do not emphasize the change in topics for a location over time. Both Hao et al. (2010) and Yin et al. (2011) address the problem of comparing regional topics, which is related to our question of how to compare discussions. Future analyses may integrate these comparative approaches by examining how typical the microtext produced in a location-driven discussion is to that region. Last, we note that in a study of viewing trends for YouTube, Crane and Sornette (2008) propose a model of online viewing that identifies three classes of user behavior. Unlike our analysis, they model only viewing counts, whereas we consider factors such as location and repeated participants. #CardinalNation #TakeJake #Cubs #stlcards #MLB #ASG #BestTeam #FinalVote #FreesePlease #Cardinals 0.12 0.08 0.06 0.04 0.02 0 5 10 15 20 Date in July, 2012 25 30 Figure 9: Pagerank values of the top ten hashtags appearing in the “Cardinals” network for the month of July. Figure 7 of all nodes in the largest component of the “Cardinals” graph. In addition, the shape of our node closeness distributions was not as consistent as that of the degree and PageRank distributions. We obsereved a variety of distributions which appeared to be nearly uniform, to the normal shape discussed above, dependent on the day and keyword being represented. Closeness appears to be a good indicator of tiers of importance for entities of a given type. In Figure 8, #Cardinals is consistently a top-ranked hashtag, #MLB, #stlcards, and #CardinalNation are of secondary importance, and the more trendy hashtags, such as #takejake, only spike in importance around the dates that they are relevant to the discussion. Levels of importance also appear when we consider other entities, such as locations or users, and for other keywords as well. Ranking hashtags by PageRank provides a similar ordering (Fig. 9); however, the PageRank values do not vary sufficiently to differentiate between levels of importance. Closeness values should further enable discussion analysis by highlighting which entities are of highest importance. 6 7 Conclusion and Future Work We presented an initial case study on using a network model and associated statistics to analyze and characterize online networks. Motivated by the work of Ruiz et al. (2012), we generated a network model that incorporated nodes for five discussion features: tweets, users, hyperlinks, hashtags, and locations, with links between nodes based on the content of a tweet. Using these discussion networks, we analyzed four network statistics that were found to be highly predictive of stock market trends: node type distribution, the number of connected components, PageRank, and node degree. In addition we proposed two new features, the normalized diameter and the nodes’ closeness distribution, both of which offered superior insight into the discussions. As a result of our analysis, we also found that the two highly predictive properties used by Ruiz et al. (2012), degree and PageRank, are easily reducible to simple frequency counts. Furthermore, a secondary analysis of the distribution of node degree and PageRank revealed that these distributions are stationary over time, and therefore neither aid in discussion analysis nor in prediction. We believe that modeling online discussions as relationships between locations, people, and topics, has potential to reveal interesting properties of online discourse; but in order to better exploit the interdependence of ties between Related Work Several works have leveraged the structure of microblog discussions to discover important features, such as hyperlinks (Shamma, Kennedy, and Churchill 2010), hashtags (Romero, Meeder, and Kleinberg 2011), events (Lee, Wakamiya, and Sumiya 2011), or tweets themselves (De Choudhury, Counts, and Czerwinski 2011). Romero, Meeder, and Kleinberg (2011) analyze the growth and persistence of hashtags in different topic categories, demonstrating that the emergence of a popular hashtag is highly topic dependent. Lee, Wakamiya, and Sumiya (2011) leverage geocoded microtext to discover regional events as well as to categorize events by geographic regions. In contrast, our analysis may incorporate those discussions that are not regional; however, the identification of region-specific discussions may enable additionally discovering important 97 these entities, a structural analysis that moves beyond basic graph metrics is needed. For example, higher-order network features such as motifs (Milo et al. 2002) may provide evidence of repeated connectivity patterns to better discriminate between discussion types. Furthermore, different methodologies for modeling discussions as networks, e.g., linking users together, may provide graphs that are more amenable to discussion analysis. Our immediate future work will focus on two areas. First, we plan to use the presented network features to classify the discussions of novel terms and discover discussions that change type over time. For example, we might expect to observe a stabilization of the discussion involving collective organization where participants begin to use a set of common hashtags and are active in similar locations. Second, we plan to explore the use of the discussion network itself as a forensic tool for understanding the key elements in a discussion over time. Although the frequency of a hashtag or location may suggest the most important feature at a moment, once classified, the type of discussion can reveal which types of entities will generate sustained discussion in days to come. We see the potential application of our microblog analysis toward the geographic tracking of public interest in a topic. For example, the keyword “influenza” may appear to characterize a local discussion geographically centralized in a small area, but the most central location to the topic may change over time. We would thus compare the apparent interest in influenza with the spread of the illness, and look for correlation between immunizations and the importance of the influenza discussion topic to a geographic region. We are also interested associating geographic patterns in topical discussions with social and political movements, which can help answer questions on how wide-spread and connected a movement is. Crane, R., and Sornette, D. 2008. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences 105(41):15649–15653. De Choudhury, M.; Counts, S.; and Czerwinski, M. 2011. Identifying relevant social media content: leveraging information diversity and user cognition. In Proceedings of the 22nd ACM conference on Hypertext and hypermedia, 161–170. ACM. Hao, Q.; Cai, R.; Wang, C.; Xiao, R.; Yang, J.; Pang, Y.; and Zhang, L. 2010. Equip tourists with knowledge mined from travelogues. In Proceedings of the 19th international conference on World wide web, 401–410. ACM. Honey, C., and Herring, S. 2009. Beyond microblogging: Conversation and collaboration via twitter. In System Sciences, 2009. HICSS’09. 42nd Hawaii International Conference on, 1–10. IEEE. Hong, L.; Ahmed, A.; Gurumurthy, S.; Smola, A.; and Tsioutsiouliklis, K. 2012. Discovering geographical topics in the twitter stream. In Proceedings of the 21st international conference on World Wide Web, 769–778. ACM. Huang, J.; Thornton, K.; and Efthimiadis, E. 2010. Conversational tagging in twitter. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, 173–178. ACM. Kumar, R.; Novak, J.; and Tomkins, A. 2010. Structure and evolution of online social networks. Link Mining: Models, Algorithms, and Applications 337–357. Lee, R.; Wakamiya, S.; and Sumiya, K. 2011. Discovery of unusual regional social activities using geo-tagged microblogs. World Wide Web 14(4):321–349. Mei, Q.; Liu, C.; Su, H.; and Zhai, C. 2006. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of the 15th international conference on World Wide Web, 533–542. ACM. Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; and Alon, U. 2002. Network motifs: simple building blocks of complex networks. Science 298(5594):824. Newman, M.; Watts, D.; and Strogatz, S. 2002. Random graph models of social networks. Proceedings of the National Academy of Sciences of the United States of America 99:2566–2572. Preis, T.; Reith, D.; and Stanley, H. 2010. Complex dynamics of our economic life on different scales: insights from search engine query data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368(1933):5707–5719. Romero, D.; Meeder, B.; and Kleinberg, J. 2011. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In Proceedings of the 20th international conference on World wide web, 695–704. ACM. Ruiz, E.; Hristidis, V.; Castillo, C.; Gionis, A.; and Jaimes, A. 2012. Correlating financial time series with micro-blogging activity. In WSDM’12. Shamma, D.; Kennedy, L.; and Churchill, E. 2010. Tweetgeist: Can the twitter timeline reveal the structure of broadcast events? In CSCW 2010. Wang, C.; Wang, J.; Xie, X.; and Ma, W. 2007. Mining geographic knowledge using location aware topic model. In Proceedings of the 4th ACM workshop on Geographical information retrieval, 65–70. ACM. Yin, Z.; Cao, L.; Han, J.; Zhai, C.; and Huang, T. 2011. Geographical topic discovery and comparison. In Proceedings of the 20th international conference on World wide web, 247–256. ACM. Acknowledgements Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI / NBC) contract number D12PC00285. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBE, or the U.S. Government. References Boyd, D.; Golder, S.; and Lotan, G. 2010. Tweet, tweet, retweet: Conversational aspects of retweeting on twitter. In System Sciences (HICSS), 2010 43rd Hawaii International Conference on, 1–10. IEEE. Brin, S., and Page, L. 1998. The anatomy of a large-scale hypertextual webs search engine. Computer Networks 107–117. Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM international conference on Information and knowledge management, 759–768. ACM. 98