Analysis Of The Diffusion Of Innovations In Social Networks Master Thesis By Lucas Brönnimann Brugg- Windisch FHNW 07/02/2014 Note: Original manuscript was in German and translated with http://translate.google.com/#de/en/ and edited by Ken Riopelle Social Network Analysis ABSTRACT In social networks, new ideas and thoughts can spread quickly. A diffusion analysis identifies which topics propagate through the network, the speed with which this happens, and the people who are particularly successful at communicating their ideas. This work is particularly concerned with the last point and seeks to find the influential people, who efficiently pass on to their immediate communication network their behavior terms and concepts. With the use of multilingual text analysis methods, a diverse set of communication networks are investigated and evaluated with a proposed set of content measures derived by the language and behavior of the people who communicate with each other. The goal is for the detection of the most important or influential actors. The term "influence" is defined as the effective impact of a person on others within an observed communication network. To validate the content derived Influence measure, four different communication network data sets are evaluated, which include: e-mail communications of individual actors, project teams and Twitter sub-networks. The results indicate that the operationalized definition of Influence derived from text analysis is likely to find important people within a communication’s network, even in situations in which conventional methods have proved unable. 2 Social Network Analysis Contents Abstract 1. INTRODUCTION .............................................................................................................................. 6 1.1. ABOUT THIS DOCUMENT .............................................................................................................................. 6 1.2. BACKGROUND ................................................................................................................................................ 6 1.3. OBJECTIVE ...................................................................................................................................................... 7 1.4. SOFTWARE DEVELOPMENT ........................................................................................................................ 7 2. ANALYSIS OF COMMUNICATION ............................................................................................... 9 2.1. STRUCTURE OF COMMUNICATION NETWORKS ........................................................................................ 9 2.1.1. Twitter .................................................................................................................................................... 10 2.1.2. E -­‐mail networks ................................................................................................................................ 12 2.1.3. Additional Networks ........................................................................................................................ 13 2.2. NETWORK ANALYSIS ................................................................................................................................. 14 2.2.1. Centrality .............................................................................................................................................. 14 2.2.2. Other metrics ....................................................................................................................................... 16 2.3. TEXT ANALYSIS .......................................................................................................................................... 16 2.3.1. Metrics for the analysis of text ..................................................................................................... 17 2.3.2. Advanced Text Analysis .................................................................................................................. 19 2.4. INFLUENCE AND INNOVATION ................................................................................................................. 22 2.4.1. Definition of influence ..................................................................................................................... 22 2.4.2. Changes in a person’s word histogram .................................................................................... 23 2.4.3. Influence of individual messages ................................................................................................ 24 2.4.4. Influence -­‐ combination of: similarity, relevance and time delay ................................ 25 2.4.5. Example ................................................................................................................................................. 27 3 Social Network Analysis 2.4.6. Tracing Innovation through text over time ........................................................................... 28 3. IMPLEMENTATION AND EVALUATION ................................................................................ 30 3.1. IMPLEMENTATION IN CONDOR ............................................................................................................... 30 3.1.1. Functionality of Condor .................................................................................................................. 30 3.1.2. Tools for Analysis of the network ............................................................................................... 32 3.1.3. Visualization of a Network ............................................................................................................ 37 3.1.4. Detailed nodes and edges information .................................................................................... 38 3.1.5. Diagrams ............................................................................................................................................... 40 3.2. TEST DATA: OVERVIEW AND TREATMENT ........................................................................................... 44 3.2.1. Twitter: Politics .................................................................................................................................. 44 3.2.2. Twitter: BMW ...................................................................................................................................... 45 3.2.3. E -­‐mail: COIN Seminar ..................................................................................................................... 45 3.2.4. E -­‐mail: Private E -­‐mail accounts ............................................................................................... 46 3.3. EVALUATIONS ............................................................................................................................................. 47 3.3.1. Twitter: Politics .................................................................................................................................. 47 3.3.2. Twitter: BMW ...................................................................................................................................... 51 3.3.3. E-­‐mail: COIN Seminar ...................................................................................................................... 53 3.3.4. E -­‐mail : Private E -­‐mail accounts .............................................................................................. 56 3.4. DISCUSSION ................................................................................................................................................. 59 3.4.1. Properties of influential actors .................................................................................................... 59 3.4.2. Properties of influential messages ............................................................................................. 60 3.5. INFLUENCE ADVANTAGES AND OTHER APPLICATIONS ...................................................................... 60 4. CONCLUSION ................................................................................................................................. 62 5. RECOMMENDATIONS FOR FUTURE WORK ........................................................................ 63 4 Social Network Analysis 6. BIBLIOGRAPHY ............................................................................................................................ 64 7. LIST OF FIGURES ......................................................................................................................... 67 8. GLOSSARY ...................................................................................................................................... 69 9. HONESTY STATEMENT .............................................................................................................. 71 10. APPENDIX A ............................................................................................................................... 72 11. APPENDIX B ............................................................................................................................... 74 5 Social Network Analysis 1. Introduction 1.1. About this document This document describes several methods for the analysis of communication in social networks. In the first part, network terms are defined as well as the theoretical basis for the analysis of networks. Based on this, a system is described which examines the diffusion of new ideas, concepts and innovations for the purpose of identifying the relevant or most influential actors in this process. Besides the theoretical description of the system development and function, a number of enhancements for the software "Condor" are proposed and tested in the analysis of networks. The second part of the document includes four test cases using e-mail and Twitter communication datasets, which are used to demonstrate and evaluate the text derived Influence measure’s accuracy and utility. Insights in the development and application of these test cases are documented and are discussed in more detail. The conclusion offers suggestions for future work. 1.2. Background The analysis of social networks is used to identify current trends to find important people in the network and to observe the distribution of information. This is also known as cool hunting. [1] Previous studies are often based in large part on the analysis of the structure of the corresponding network and use only well-known network algorithms for finding central persons, as for example, the betweenness centrality [2] or Google 's PageRank [3]. Other works deal with the analysis of the communication itself, and use different themes from Natural Language Processing. For example, content of messages can be studied with the help of Sentiment Analysis, to assess how positively or negatively actors communicate in a network. Here, the structure of the network is of little or no consideration. Both of these approaches are implemented in the current version 3.0.a8 of the Condor software. [4] This software consists of a number of modules to retrieve data from multiple sources and the subsequent analysis and visualization of these networks. However, the text analysis is limited to the analysis of Sentiments and finding important keywords. Consequently, the software cannot determine how topics spread in networks, or identifying the key influential people. 6 Social Network Analysis 1.3. Objective The flow of information in social networks among actors is to be examined with the aid of text analysis methods. To this end, appropriate metrics are defined which describe the flow of information and the influence that an individual actor makes on these measures. In particular, the metric for measuring the influence of an actor within its network is the primary objective of this study and will be reviewed for its quality and validity. In order to measure the influence that the speech of the actors has on others, it will be important to analyze speech over time. Several methods of text analysis will be used to determine whether a message from A to B has an influence on – the future messages sent by B. In addition, other text metrics will be examined, such as, the average complexity of the vocabulary used, and the sentiment of the text for its positivity or negativity. The test data are collected networks of Twitter, as well as e-mail networks of single individuals and project teams. Furthermore, factors, which impact the new influence metric, are discussed along with its inherent limitations. This can be used to determine what type of message, have a high chance of influence on other actors in the network. 1.4. Software Development The Galaxy Advisor’s software, Condor, is used to analyze social networks. Additional functionalities, which calculate, among other things, the influence of actors on their surrounding network including visualization were added. Specifically, a text content processing Module1 was developed for this purpose and incorporated into the software. These enhancements are documented in accordance with the programming policies of Condor and checked by means of JUnit tests. All of the new features can be used productively in the current version of Condor 3. 11 See section 3.1.2 Tools for Analysis of the Network 7 Social Network Analysis Table 1 shows the values that Condor can calculate through analysis of the nodes and edges in a communication network. The four data elements written in red italics have been added by this work. Nodes Edges (with text) Degree Centrality Text length Closeness Centrality Sentiment Betweenness Centrality Emotionality Contribution Index Response time Influence Table 1 Available metrics in communication networks 8 Social Network Analysis 2. Analysis of Communication This chapter focuses on the theoretical aspect of the work. It consists of an introduction to the structure and definitions of communication networks and describes conventional and new possibilities for textual content analysis. This is followed by an in-depth description of a method for measuring the influence of individual actors and for tracing their impact on the diffusion of innovations in communication networks over time. The listed methods of analysis for actors and messages are those that are implemented in Condor 3 or were implemented during the course of this work. 2.1. Structure of communication networks In order to observe the diffusion of innovation and to measure the influence of individuals in within the network, some network structures are clearly preferable over others. An ideal data network meets the following conditions: • • • • Access to all data in the network is available All external factors are known and controllable All relevant communication between the actors found in the observable network are available There are no irrelevant and not useful data included (spam) While such ideal networks in reality do not exist, there are several that come relatively close. In particular, e-mail networks in the academic and business environment are to be considered for this purpose. Ideally, a study would have complete email access to one or more project teams over a long period of time. Of course for this email data collection, the consent of all parties is required, which is not always practical. Another very suitable network for analysis is Twitter, because the data is very easily accessible. This includes Twitter direct messages, which are simply marked with a “@Twitter name” and can be viewed by anyone. Facebook is a possible data source, but is not as desirable because of its restrictive private settings, which prevents much of the Friends communication from being included. Thus, Facebook is not a desired data source for this study. Likewise, other networks suffer from the problem that the data are difficult to access and often reflect only a portion or small sample of the relevant communication. Nevertheless, Condor’s "Web Fetcher" does make it possible to 9 Social Network Analysis collect data from smaller communities, if they are made public on websites, blogs or Wikipedia articles. 2.1.1. Twitter Twitter is one of most easily observable networks, because of its API (Application Programming Interface). A Twitter network can be practically observed, when it is filtered or bounded by a search term, or by a follower network for analysis. It is impractical to analyze all Twitter activity at once. Twitter networks can be represented in different ways. In general, a graph is created, in which all observed Twitter accounts mapped to nodes and the tweets are represented in the form of edges. This is the case with Condor 3 and the software is described in greater detail in Chapter 3.1.1. A Twitter search on a term results is a set of tweets, which are composed of authors or nodes and the edges between people are obtained by a so-called @Mention, symbolized by an “@” sign followed by the desired user name within the tweets. Also, retweets, are mapped by creating an edge between the author of the original tweets and the person who retweeted it. Depending on the Condor Twitter Fetcher options selected, there are differences in the structure of the resulting network visualization. Figure 1 shows a network in which the author of a tweet is linked to the search term, which yields a star shaped structure with the search term or node in the center. Figure 1 Illustration of a Twitter network, where all tweets are connected in the center to the search term. In contrast, Figure 2 represents two visualizations where the tweets are not connected to the search term. The figure on the right shows the same thing, but 10 Social Network Analysis without the non-affiliated nodes. It is apparent that these visualizations have a very loose network structure without much communication among the users. Figure 2 Twitter visualizations where the tweets are not connected to the search term. As an alternative, if a person’s tweets are collected over time which include both retweets and tweets with an @Mention to others, then this will create a denser network and more amendable for the kind of text analysis of interest in this study. Figure 3 shows an example of such a network in which a random selection of Swiss politicians and their last 100 tweets were selected. So this denser network structure is much more useful and better suited for the purpose of this work of finding important people within a network. The ability to create such a network is currently not available in Condor’s production version, but requires a relatively small amount of code changes. Figure 3 A random selection of Swiss politicians and their last 100 tweets forming a denser network 11 Social Network Analysis 2.1.2. E -mail networks E -mail networks are quite ubiquitous, but appropriate access is often difficult to obtain, except in laboratory situations. Nevertheless, there are ways to obtain e-mail data for analysis. For example, it is possible to analyze ones own mailbox. Figure 4 shows a network created by the analysis of a single mailbox of a student. The individual nodes each representing an e-mail address and each sending of an e-mail created an edge to every receiver, even if it was only in the CC. Note: persons with multiple e-mail addresses can occur several times in different places in the network, unless they are merged into a single e-mail address. Figure 4 The e-mail network of a single person can have very different structures. In cases where the appropriate privacy agreements have been implemented, it is possible to analyze the e-mail networks of project teams. An example is shown in Figure 5, which represents the e-mail communication of 11 project teams with an average of 5 people per team. In this case, a dummy e-mail address was created for each team, and team members copied all their email to this dummy e-mail address during the course of their project. 12 Social Network Analysis Figure 5 Example e-mail network of 11 project teams A weakness in this data collection process is the very real possibility that a large amount of communication among team members does not occur within the email network, because members use other channels of communication, such as, face-to-face meetings, conference calls, or text messages. Despite the limitations of e-mail networks, they have the potential for very deep insight into the communication of the respective employees because often people have saved an e-mail archive for a project. Using e-mail archives makes it quite possible to observe longer periods of time and visualize the development of collaborations between people and teams and trace new or important topics by measuring their speed and diffusion throughout the network. 2.1.3. Additional Networks To explore the development of new issues, data sources from blogs and Wikipedia articles are also possible. In such a network, the nodes correspond to the web page of the blog, or the Wikipedia article. The edges are links to the respective other websites. The problem with blogs and Wikipedia articles is the difficultly of determining where the information originated and how the information flow behaves exactly. Thus, blogs and Wikipedia data are good sources for the discovery of important issues and can be analyzed with sentiment analysis, but are inherently problematic to assess the dissemination of information over time. 13 Social Network Analysis 2.2. Network Analysis Network algorithms enable the calculation of various metrics, such as, network centrality measures, which can be applied to all the actors. It is not necessary to examine the content of the communication for these calculations. In this chapter, the following network structure measures are described in more detail: • • • • • Degree Centrality Closeness Centrality Betweenness centrality Activity and Contribution Index 2.2.1. Centrality For an actor, or node of a network, different measures for centrality can be calculated. Important values include: Degree Centrality, Betweenness Centrality and Closeness Centrality, which for each node of the graph are hereby determine a value for centrality with V nodes. Degree Centrality This very simple measure is defined as the number of edges connected to a node. It describes the number of actors that have a direct connection with a node. [5] Closeness Centrality The Closeness a node is calculated from the sum of its distances d to all other nodes. Closeness Centrality of a node is defined as the inverse thereof. A very central node, thus have a higher closeness centrality value and are often connected to other nodes with similar high value. [6] 14 Social Network Analysis Betweenness Centrality The calculation of betweenness centrality is based on the shortest paths between all the nodes of the network. The frequency with which a node is on the shortest path between two other nodes is the decisive factor. [2] The calculation is performed according to the following steps: 1. For each pair of nodes ( s, t ) calculate the shortest paths between them 2. For each pair of nodes ( s, t) calculate the proportion of the shortest paths, the node V is found on 3. Sum this fraction over all pairs of nodes ( s, t) This can be represented as a formula: σ 𝑠 𝑡 is the number of shortest paths from node 𝑠 to 𝑡 and σ 𝑠 𝑡 (𝑣) is the number Betweenness Centrality is especially useful in analyzing social networks, because it often those nodes with a high value are the actors that communicate across project teams. Thus, actors with a high Betweenness Centrality value play a key role in the dissemination of information. Figure 6 shows an example of a network, where the color value of a node indicates its betweenness centrality. The dark blue nodes in the center have the highest betweenness centrality value, and the red nodes on the outside have the least. 2 Figure 6 A node’s Betweenness Centrality with the dark blue nodes with the highest value, and the red nodes the least. 2 Figure 6 created by Claudio Rocchini 23 April 2007 15 Social Network Analysis 2.2.2. Other metrics Activity Activity is simply the total number of messages sent by an actor. When comparing different time periods, it is reasonable to normalized this count and calculate the average number of messages per day. Contribution Index The Contribution index describes the "Balance" of an actor’s sending and receiving messages and is another measure of network structure. [7] It is calculated from the ratio between the sent and received messages according to the following formula: This produces values between -1 and +1, where -1 is a person who only receives messages, and does not send any messages, and a +1 is a person with whom it is just the reverse. In an ideal network, each actor sends about the same amount of messages as they receive, but in practice, this can vary greatly from person to person. 2.3. Text Analysis In addition for metrics that focus on the actors or nodes of a network, metrics can be computed for the edges or the content of the communication, which represents the connections between the nodes. For text, or a parsed edge the following measures can be computed: • Number of recipients • Length • Sentiment • Emotionality • Complexity • Response time • Influence For each actor or node, the average value of these edge data can be determined, and in certain cases, the sum of values. 16 Social Network Analysis 2.3.1. Metrics for the analysis of text For a single text, some values can be relatively simple to calculate, others are more difficult. Of interest in this study are the following text metrics: length, sentiment, emotion and complexity. Note: for computer performance reasons it make sense to calculate a measure only once for a message, when multiple recipients occur on the same message and just copy that result to the other recipients. Number of receivers The number of recipients is not a metric of the text itself, but may be crucial for various comparisons. It is quite possible that some people do not use the same communication style when they talk privately with an individual or when they talk to several people at once. Length Text length is the number of characters of a message. On Twitter, there is a maximum number of 140 characters. Sentiment Sentiment is a value between 0 and 1, where a high value is positive and a lower value indicates a negative sentiment. It is calculated from a specially developed multi-lingual classifier based on a machine learning method with data from Twitter. The analysis makes it possible to classify texts in the languages of German, English, French, Spanish and Italian and was trained using a total of about 200 million tweets with negative and positive emoticons. The exact way it works is detailed in the paper "Multi - Language sentiment analysis of Twitter data on the example of Swiss politicians." [8] The principle is based on an idea by Patrick de Boer and other research studies. [9] [10] Emotionality Because sentiment is calculated as the average of the whole text, sometimes information can be lost. For example, the following messages are both classified as approximately neutral: "I bought a smartphone today." "I love my new phone sooo much! The old one was terrible." In the second text, the positive and negative statements are roughly in balance, but the text is much more emotional. Such texts can be distinguished from one 17 Social Network Analysis another by the value of emotionality. The Emotionality value is calculated as follow: Complexity The complexity of a text can be measured in several ways. Often when calculating the complexity of a text, the sentence length plays a crucial role, but this advantage creates a problem when using multiple languages. A solution is to use the 𝐷 𝑇 𝑤 𝑖 𝑡 𝑡 𝑒 𝑟 training data set used already for the calculation of sentiment and emotion. This source contains the information about how often individual words in the different languages were used on Twitter in combination with positive or negative emoticons. For example, the word "Internet" was used 1,182 times in a total of 1,600,000 German tweets. It is possible to determine roughly the probability 𝑝(𝑊𝑘) for a single word to appear in a random tweet. The logarithm of the inverse of this probability log [1/ (𝑝(𝑊𝑘)] just equals the value IDF (Inverse Document Frequency ), which is very well suited to the rarity of terms in a corpus of documents to describe. The higher the average of the IDF values for all words of the text to be tested, the more simple the text. For texts in which more words are included, which occur less frequently on Twitter, the value is correspondingly smaller. The average Tweet in German corresponds approximately to the value 7.1 with a variance of 1.37. Response time This value is specifically intended for e-mail networks. It measures the average time it takes for a response between the sender of a message and a response by the receiver. 18 Social Network Analysis A similar value can be calculated for "turn taking" which is the average number of messages a sender answers. Note: the “turn taking” metric was not implemented in Condor 3 at the time of this study and thus cannot be considered further in this report. Influence Influence attempts to classify the importance of a message to each actor within a network. Influence takes into account whether a receiver reacts after receiving a sender’s message in any form, or changes his or her behavior. For example, influence measures when a recipient uses a new word for the first time, just after the receipt of a message. A person who retweets or forwards a message from a sender counts as a "successful" influential message for the sender. The stronger or faster this reaction is by recipients, the higher influence value is for the sender. The influence measure and its exact calculations are described in Section 2.4 in more detail. 2.3.2. Advanced Text Analysis Depending on the network data, additional steps may be necessary to clean the text data for analysis. For example, it may be necessary to remove HTML/formatting as well as remove email chain messages to confine the analysis to the most recent or last message. Remove HTML / formatting Removal of HTML is used in this work using the Java HTML parser Jsoup . Recognize Latest posts on e -mail communication Most emails contain not only the latest message, but also the chain of the previous communication between the recipients. Thus, for example, located in an edge from person A to person B, is the following previous communication: Yes, Friday suits me fine. On 03.05.2013, at 11:33, Person B wrote: Do you have time on Friday for a meeting? The text of Person A consists of only five words (“Yes, Friday suits me fine.”), but saved is also the previous correspondence. Thus, it is necessary first to separate 19 Social Network Analysis A’s message from B’s, along with the information as to when and by whom the original email was sent. The problem is that each mail program has its own formatting conventions to separate the new from the old message. A perfect detection rate is thus very difficult and requires correspondingly good training data. A manual generation of a regular expressions ( regex3 ) for every conceivable variant ensures the greatest possible detection rate and can cover over 95 % of cases. The following are some of the examples in which the detection is possible with this system: • New content is here On 03.05.2013, at 11:33, Person B wrote: Original message is here • New content is here On Wed, November 6, 2013 17:57:06 CET wrote Person B Original message is here • New content is here From: Person B Posted: Original message is here • New content is here From: Person B [mailto: b.person@fhnw.ch] Sent: Original message is here • New content is here ---- Original Message --Original message is here In these examples, using regex terms, such as, with the phrase “New content is here” enables the ability to parse the data and select just the most recent text for analysis. Also, a similar regex step can remove typical signatures such as "sent from my iPhone" and the like, which are not adding value for the Influence text analysis here. 3 A regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. See: http://en.wikipedia.org/wiki/Regular_expression 20 Social Network Analysis Treatment of hyperlinks Hyperlinks can be relatively safely removed with simple regular expressions. In most cases, it is advisable to completely remove hyperlinks from the text, unless it is of interest to trace the exact distribution of these links in the network. The following fictional Tweet serves as an example in which it is better not to include the hyperlink, at least for the calculation of sentiment and emotion into consideration: This is the stupidest thing I’ve ever read! www.xyz.ch/gold-makeshappy.html The text is fundamentally negative, as the author of the statement in the linked article sneers. The words in the hyperlink (“gold makes happy”) are much more positive and a calculation of the Sentiment would result in an approximately neutral or even positive text. Identifying Keywords For the calculation of the Influence value and for different visualizations, it makes sense, to identify the most important or keywords words. There are various possibilities, two of which are described as follows: Identifying keywords with regular expressions It is relatively simple to remove from the text by regex for all non-word characters. A corresponding regex is thereby ideally defined as the set of all characters contained in the observed text, which are three letters or less.4 Another method is to construct a “stop-word” list of those words deemed unimportant, such as the articles, “a”, “the” and “an”, to be automatically removed or excluded from the text analysis, which results in a relatively uncluttered keyword list. Identifying keywords with OpenNLP A slightly more elaborate method uses the Java Library OpenNLP. [11] In this way it is possible to first divide text into sentences and then to subdivide these again to identify keywords. Instead of a stop word list, it 4 However, a separation at each \ s or (white spaces) or to any \ W (no word characters) is often too imprecise and should be further refined 21 Social Network Analysis makes sense to let OpenNLP determine the noun in the text and only use these as the keywords. Normalization and stemming The collected Keywords are often available in different circumstances and with different notation. So that this does not cause problems, the words can be simplified in advance through the processes of normalization and stemming. Normalization is this process, where umlauts ÄÖÜ are replaced by the vowel of an appended “e”. In addition, stemming can be applied, which reduces words to their root word. For example, the word "categories" can be shortened to "categori". However, a stemmer in the appropriate language needs to be used and it should be noted that stemming does not impact visualization. 2.4. Influence and Innovation After all the preparatory work, the crucial task can be approached to determine which messages are really important. This information is used to measure the influence of individual actors on the overall network. First, “influence" needs a more precise definition. The exact definition is then mathematically described and explained with an example. 2.4.1. Definition of influence Different applications use different definitions for the term "influence". Here are a few examples. In Facebook networks, it is possible for example, to analyze the distribution of links, or even Facebook games. For example, a person was affected when she starts to play the game after another person has previously sent the automated advertising for the game in network of friends. [12] [13] The so-called Klout Score calculates influence based on the number of followers, frequency of retweets and some other factors. [14] Unfortunately, the exact calculation is proprietary, and thus cannot be compared with other values . People can also be influenced to join a network [15], or to write about certain subjects. [16] The concept of homophily, where persons communicate with those who are similar to themselves has found to exert influence. However, influential people can break this barrier, so that for example, despite large personal differences they follow very heterogonous people on Twitter. [17] 22 Social Network Analysis Despite these various definitions of influence, each is trying to measure whether a person can cause a certain behavior change in their environment. Often, this behavior is directly visible in the communication network, for example in the form of retweets, new discussion topics or changes in the structure of the network. Observable behavior changes can be found in the language used by people of the communication network, which changes over time. An influential person is able to introduce new ideas, beliefs, and behavior patterns, and this can be summarized with the term "meme", which was introduced by Richard Dawkins. [18] These memes become visible in messages within the communication network. Thus, Influence here is defined as the amount of new terms, concepts, and ideas which a person has introduced into the network and which are subsequently used by other members of the network. The following chapters describe how to measure a person’s influence within networks of communication. 2.4.2. Changes in a person’s word histogram A first approach for the calculation of Influence consists of the analysis of the average language use for each actor within a predefined period. Thus, a word histogram of the most commonly used words is saved as a vector for each actor or node. If a person writes new messages, this vector changes continuously. It is now possible in fixed time intervals to check how the vector of individual actors has changed and to measure the difference. If the difference for an actor X is known, then it can be compared with the vectors of all the other actors, of which actor X previously received messages. If actor X for example, was influenced by actor Y, it is possible that X now starts to use words, which could be attributed to Y’s previously messages. Thus, a measure can be derived for the Influence that actor Y had on actor X. This can be calculated from the change in the angle between the two vectors. It should be noted that this procedure should not consider any stop words. It can also be useful to analyze the words that are in the messages of Y to X. A major disadvantage of this method is that various random factors may distort the Influence measure. The language use of an individual is constantly changing, and often depends upon human interaction. Particularly problematic is when two people interact often, because with this procedure, it is unclear who actually influences whom. This specific case was not pursued because of the large amount of random effects present, but is acknowledged as a procedure limitation. 23 Social Network Analysis 2.4.3. Influence of individual messages While the method in the previous section directly examines the people, the focus can shift to the messages themselves and their effects on the recipients. The Influence metric compares the sent message with the next messages sent by the receiver within a defined arbitrary period r. The prehistory message exchange of the sender and receiver are not currently considered in this situation. When viewing a message d from person A to person B the following occurs: All messages sent in the next few days to , by the receiver of the message d, are analyzed and compared with d. The difference in time determines how strongly incorporated the message in the calculation is according to a linear scale, defined as , where r is a period of 4 days. Both 𝑑 for the message and for each message to is now the TF-IDF value where, the terms TF and IDF refers to Term Frequency, and Inverse Document Frequency, and are calculated for all terms as a vector for all sent messages in the network. The Social Network Analysis 𝑣𝑑 = [𝑤1,d, 𝑤2,d, ..., 𝑤N,d] T is stored. This vector allows the similarity calculation according to the Vector Space Model [19] as the cosine of the angle between various messages The influence of the message 𝐸(𝑑) now calculated from the sum of the similarities between 𝑑 and the subsequent messages 𝑓𝑛 with the temporal scaling factor 𝑧 (𝑑, 𝑓). 24 Social Network Analysis The overall impact of person A is the sum of the values of all the influence of their sent messages. The effort to calculate this depends on the amount messages sent by person A, and the average activity of the recipients of such messages. The problem with this system is that we are unable to verify this assessment with the persons being studied. It is entirely possible for the recipients of the words in the message sent by A, to have been using the words for some time, prior to A’s original message in our data set. There is also the risk that two people, who are often talking within the network to one another, are using a very complex language or vocabulary. In this case, a few people are using a specialized vocabulary, which is very different from the rest of the network. 2.4.4. Influence - combination of: similarity, relevance and time delay The system described in the last chapter already works very well in many cases. However, the complexity of text5 plays a primarily role, when making a comparison between the transmitted message and the subsequent messages of the recipient. Every person has a unique use of the language, which will also change over time. This is especially the case when using a person’s name, which may occur in every message before on a greeting line. There needs to be a system, which checks for the previous use of terms. The basic principle of Section 2.4.3 remains available, but it also checks whether the concepts introduced are in fact new for the recipients or whether recipients already have sent messages with similar content. In addition, it may be noted that certain terms may not appear until later in the network and thus the importance of TF - IDF calculation is not ideal from all messages of the network. Instead, fall back on the known record of Twitter (𝐷𝑇𝑤𝑖𝑡𝑡𝑒𝑟), thanks to which the frequency of individual words is known. As mentioned in Chapter 2.3.1, the value log [1 /(𝑝 (𝑊𝑘)] will be also denoted just the IDF value within 𝐷𝑇𝑤𝑖𝑡𝑡𝑒𝑟, and could 5 See section 2.3.1 Metrics for the analysis of text 25 Social Network Analysis also be used instead of Complexity as 𝐼𝐷𝐹 𝑇𝑤𝑖𝑡𝑡𝑒𝑟. The value of W 𝑡,𝑑 has therefore been changed and is no longer dependent on the context. As an alternative without the IDF value from the documents around the time of d could also be used, but it may be unfavorable, if not a lot of text is present or just a period of time is taken in which the analyzed term was overly important. Ideally each of the IDF term values would be calculated from the largest possible number of documents from messages of similar communication networks. The crucial change in the formula, however, is the inclusion of the relative importance of the language of the recipient. If the recipient has previously used very often all the terms 𝑑 in a message, then obviously there is no influence here. This factor importance(d,f) determines whether the similarity between two messages is relevant. Now when d compared with a subsequent message, then calculates the individual influence of the product of time factor, similarity of messages and relevance of this similarity. This results in the final definition of "influence" of a single message is as follows: In summary, the Influence of a message is dependent on three factors: similarity to the subsequent messages sent by the receiver, relevance of this similarity and the difference in time between receipt and response. The combination of these three criteria allows, regardless of the type of network, the ability to accurately determine the extent to which a message has caused changes in the behavior of the recipient. 26 Social Network Analysis 2.4.5. Example The following simplified example illustrates the procedure. In this network, four members of a team exist, called A, B, C and D, as well as an expert E. Since everyone communicates with everyone, then the values of centrality for all nodes are exactly the same, and therefore offer no conclusion as to who in this network is more important. Using text analysis, a new value, “Influence”, will now be created to ascertain when who is more important in a message exchange over time. For this example, assume that the words "problem" and "solution" are generally rather rare occurring words. The time between each message is identical in each case and there are no more messages on the network. In the visualization, the more influential people are marked with corresponding darker blue color. 1 Person A asks for the project members for help. The text that A sends to the project members B, C, and D, is: "I have a problem." 2 Person B decides to ask the expert E for help, with the E -mail "We have a problem." 3 The message from A to B has thus influenced B, as the text "a problem" exists in both emails. Since B was not active in the network, these are new terms for him and even if both occur frequently in the network, "problem" has a high complexity. Person A has clearly influenced the network at this point in time. 4 Expert E responds to B with the mail "I have a solution to the problem." 5 The message from B to E thus gained influence since E responded, with the Word "problem" mentioned. B has now become more important in the network. 6 B then writes a message to A, C and D. The content is " E has a solution" 7 The message from E to B is considerably influential, since B has just sent three persons a message with the rare word "solution" which has been learned from the last message from E. Person E has thus become the most important because his message had the greatest influence on communication at this point in time. 27 Social Network Analysis 8 Over the next few days, all project members in their internal communication use more frequently the term "solution". 9 As the people A, C and D have received the term, “solution” from B, now B is credited as the most influential person in the network. 10 In the short run, with each use of the word "solution" in the next few days, the result is to increase the influence of other messages that have used the term before. However, in the long run, as time goes by from days to weeks, there is less influence because of the longer time factor, as well as, because now all have already used this term. 2.4.6. Tracing Innovation through text over time The system described consists of exchanges of text between a network of people and the initial identification of influential or keywords and who uses them over time. In the example of Section 2.4.5, this would be the word "solution" and to a lesser extent also the word "problem." These two words generated the greatest influence and thus are interesting for further analysis. To find the really important words, it is necessary to determine the individual influence I of a single term t. For this purpose, the first step is to create list of all influential words occurring in all messages, and then the individual importance values, are composed of the product of the IDF values summed, where N is the total number of messages, and K, the number of messages sent by the receiver in the given time window of 4 days after the receipt of a message. 28 Social Network Analysis Using a visualization of the words with the highest importance value, trends can be observed. As might be expected, in the network of the example of Section 2.4.5, the words "solution" and "problem" result as the most important. It makes sense to use these words for further analysis, for example for the discovery of who are the typical early adopters according to the definition of Everett Rogers [20] or the development of the importance of individual words over time. 29 Social Network Analysis 3. Implementation and Evaluation The system theoretically described in Chapter 2 has been implemented in the Condor 3 software. Users can test the system on any communication networks, which can be analyzed by Condor 3. This chapter describes the basic Condor operations and how the “influence” metric has been integrated into the software along with the results of four practical test applications. These consist of two subnetworks of Twitter, and two e-mail networks. These four cases provide a realistic description and evaluation of how the “Influence” measure can be used productively. 3.1. Implementation in Condor While several of the functionality described in this section was already implemented in Condor 3, others had to be developed from scratch. This chapter describes the basic fundamental capacities of Condor and their application. In addition, all new additions to the analysis of the text are described in detail, together with its integration into the software. 3.1.1. Functionality of Condor Condor 3 is a software tool for the analysis of social networks. It includes several "Fetchers", which are used to collect and store data from a variety of Internet sources, as well as a "Content Processor," which can prepare and enrich the data for analysis and visualizations. The current Condor version 3.0.a8 has a Fetcher for email, Wikipedia, Twitter, Facebook and the Web. The collected data is stored in a common format in a MySQL database. The saved data identifies actors in the network and their connections, along with selected node and edge attributes. The actors appear in each case as nodes on a graph and the links represent the edges. With the help of content processing additional attributes can be computed and used to supplement the original collected data. For example, the Natural Language Processing algorithm can be used to analyze the text of a message and calculate its sentiment and stored this value as an attribute in the corresponding message’s edge. Condor 3 has a choice of three views for visualization: a static view, which gives an overview of the entire network; a dynamic view, which also reflects the network, but also takes into account the time dimension; and a word cloud view of the corresponding text content used within the network. [21] 30 Social Network Analysis Figure 7 Screenshot of the user interface of Condor 3 Figure 7 shows a typical screenshot of the Condor 3 user interface. On the left side appears the name of an open dataset signified by a green check mark; in the middle, there appears a static view of a network and set of graphic controls above it to manipulate the graph; and last on the right side, there is panel of options to change the appearances of the nodes and edges. The Fetcher and analysis steps are accessible via the menu bar. The menu consists of seven items: File, Fetch, Process dataset, View, Analyze, Export and Help as well as numerous sub menus for processing. A typical use of Condor is to first create a new database and then create a new dataset and collect data using a selected fetcher, or merge two or more existing datasets. As soon as the data is loaded, it is available to be viewed, processed, filtered or new data values computed. The View menu is used to display and analyze one or more datasets. Depending upon the users needs, further processing steps can be performed or key network metrics can be exported including the dataset itself using the Export menu. 31 Social Network Analysis 3.1.2. Tools for Analysis of the network Under the menu item, "Process dataset" are located the options that can edit the selected dataset. A process wizard is used to assist the user in completing a possible series of forms to complete a desired task or menu selection. The following “Process dataset” submenu items can remove actors and edges from the dataset or together according to various filtering criteria: • Graph Pruning • Node Merging • Sample • Remove Specific Actors • Actor filtering by properties Three remaining menu items retain the existing actors and annotate these with new values calculated using methods of textual analysis. Required for these three text calculations is the presence of readable text, which is stored in a dataset’s edges as the messages between two or more actors. • Language annotation • Calculate sentiment • Calculate influence In this work the latter two options are very relevant. The module for sentiment calculation was considerably expanded and separated from the language recognition or annotation module and was completely rewritten to incorporate the calculation of the new Influence measure. Language Recognition and Annotation The step "Save annotation" can stand for any properties of nodes and edges, which are stored as text, to detect the language in use. On Twitter, for example, this could be the description of the people themselves; and for each Tweet, identifying the language of the actual message. The identified language is stored as a new field with the value of language [<field>]. For example, if the content of the tweets are selected for language annotation, there is a new field with the name language[content] inserted and stored, which is the short form of the language recognized under the ISO 639-1 standard [22]. For Language recognition and annotation, the Java Library "Language Detection" by Shuyo Nakatani is used, which can recognize 53 different languages. However, [23] messages or tweets on Twitter, due to their brevity, have a slightly 32 Social Network Analysis lower recognition rate than longer texts, and thus do not always reach the 99 % detection rate, which, according to Nakatani is theoretically possible. Sentiment Calculation Calculating the sentiment assumes that the language of the text to be analyzed is known, and therefore the step of the language recognition of the text is carried out in each case automatically, if the data is not annotated accordingly. Note: some text may be ignored because it is not in the desired language in order to refine the results. Figure 8 The Sentiment mapper default settings automatically use the most likely fields for analysis. Figure 8 shows the wizard for the calculation of Sentiments with the default set of options selected: Language of the dataset fields: In addition to the automatic language detection, users have the option to select: German, English, Spanish and French. If a language is selected, the wizard ignores any text that does not correspond to the selected language. Edge / Node Property to analyze : Here you can choose the field that is to be used for text analysis. If a field is present, that contains the word "content," then it is automatically selected. 33 Social Network Analysis Activate chunking : If this box is checked, then the OpenNLP module is used, to "chunk" or join the text of two or more words together. For example, the occurrence of first and last names would be joined or chunked into a singe name. This only has an effect on the visualization in the Word Cloud View. Use OpenNLP : If this box is checked, the separation of the individual words and phrases in a text are performed using the OpenNLP modules. Alternatively, the sets can also be split operations using simpler regex. Add Complexity / Emotionality : The two values complexity and emotionality can be added in this step when this option is checked. Execution of Sentiment Calculation First, a preprocessing treatment of all existing text is performed quickly. This step removes all existing HTML code in emails, and selects the top most recent message and filters out blank messages. The wizard checks in the next step, whether language information is already available, and if not, adds this if needed. This is followed by the calculation of sentiments, as well as complexity and emotionality, if the user selected this option. The exact calculation of sentiment in Condor is described in the documentation of another project [24] and is therefore not discussed here. For performance reasons, multi-threading is used, where each thread analyzes a portion of the data. A crucial step, which also increases the performance significantly, especially in e-mail networks is when multiple edges are present, which have the same text content, such as when an an e-mail has multiple recipients of the same message. It is useful to analyze such a message and then copy the results to the other edges. 34 Social Network Analysis After the wizard successful completes its processing and analysis for each edge and each node, then the following new nine data fields are appended to the original source data: • Language • Sentiment • Emotionality • Complexity • Keywords • Sentiment Words Note: If only edges were analyzed, then all nodes will get these fields appended: • Average Sentiment • Average Emotionality • Average Complexity The Sentiment Words field is used only for visualization as a word cloud in the Word Cloud View and includes the words used and the sentiment of the context in which they were used. The prior version of Condor included only this field, along with language and sentiment, while the other options in the course of this work were added. It is worthy to note the importance of the Keywords field, because it contains the most used words and how often they appeared in the analyzed text. This enables the ability to calculate the Influence measure without having to again use the raw data of the text. Influence Calculation This module is used to calculate the influence measure according to the description in chapter 2.4. Like the sentiment analysis, a process wizard walks the user through the steps for the analysis of the text. The calculation of the Influence assumes that for each edge to be processed the language annotation already exists, as well as a list of words and their frequency. 35 Social Network Analysis Figure 9 The Influence calculator menu The options for the Influence wizard are shown in Figure 9 and are partly identical with those of the sentiment calculation. The choice of language also allows filtering to limit to a designated language and it is also necessary to specify the field to be analyzed. In this step, two different types of text edge fields are allowed: 1) a field with readable content; or 2), a pre-processed text in the form a map with word and frequency in the corresponding text. The latter is preferable since it is then not necessary to call the corresponding step for the sentiment calculation. If a field named "keywords[*]" exists, this automatically is selected, otherwise a field with the name "content", is selected if one exists. Create new dataset, is an option available to check, when only edges are present, in which an actor had an impact on another actor. This can be used to better identify who is influencing whom. The name of the new dataset can be specified manually, which is proposed as a default, with the name of the existing dataset with the string, "_Influence" appended as a suffix. The calculation of the Influence adds five new fields to an actor’s record, which arise from the analysis of textual data. The following fields are created: • Total Influence • Average Influence • Messages Sent • Most common words • Most influential words The value "Messages Sent" indicates the number of sent messages. Note: messages with multiple recipients are counted as a single message. The Influence 𝑑 of a single message is calculated exactly according to the description in Section 2.4.4 as follows: The values of Total Influence and Average Influence are then calculated from the values for the out-going edges of this actor. 36 Social Network Analysis In addition to numeric values, two new textual values are added, which illustrate the use of language by listing the 5 most common and the 5 most influential words. 3.1.3. Visualization of a Network A Condor 3 network has the same fundamental format, but the properties of the individual nodes and edges may differ depending on the network type. The nodes can either be an e-mail address of a person, a Facebook account, a web page, a Wikipedia article, or even a combination of all these nodes. The edges represent a connection between two nodes, which may be an e-mail message, or a single link between two web sites or Wikipedia articles. This thesis deals primarily with human communication networks. It can therefore be assumed that a node is a real person, and the edges are interpreted as messages between people. Thus, the edges always have a text in the form of tweets, a direct message or something similar. Figure 10 The Fruchterman - Reingold algorithm provides a useful visualization of the graph. For visualization of the network, there are various options regarding the visual representation. Condor uses the Fruchterman - Reingold [25] algorithm to graph a dataset, which repels each node in a spring-like fashion. As in figure 10 can be seen, this allows also, in certain cases, the detection of clusters or teams. Condor enables a user to visually change the size, color or shape of a node, depending on a predetermined characteristic or variable of the actors. Thus, the actors can be distinguished in the network map according to their influence and important actors can be visually found much faster. In addition, nodes can be 37 Social Network Analysis colored according to different criteria, as shown in figure 11. The node size in this Figure 11 is dependent on the amount of messages sent to a respective person. Figure 11 The same network as in the previous figure benefits of different options for representation. There is also the ability to display to the node or edge labels, which, for example, describe the name or other essential characteristics. Note: for datasets with a large number of nodes this display option can make a graph unreadable, which would be the case with Figure 7. 3.1.4. Detailed nodes and edges information The calculation of sentiment and influence enable new options in the visualization of the network. If both analyzes are run, then 6 new edge fields and 8 new node fields are created, which can be viewed in the static view. The 6 new edge fields are: 1. • Language 2. • Sentiment 3. • Emotionality 4. • Complexity 5. • Keywords 6. • Sentiment Words 38 Social Network Analysis The 8 new node fields are: 1. Total Influence 2. Average Influence 3. Messages Sent 4. Most common words 5. Most influential words 6. Average Sentiment 7. Average emotionality 8. Average Complexity The new edge and node fields can be clearly displayed by right-clicking on a node or edge and selecting either "Node Details " or " Edge details ". Examples of this can be seen in Figure 12 for Node Details and Figure 13 for Edge Details. Figure 12 details of a single node after analysis Figure 13 details a single edge after analysis 39 Social Network Analysis More importantly, the analysis in the static view, either by setting labels at the nodes or, even better, by the node size of one of the new numerical values is conditional. This makes it possible to recognize the influential actors of the network directly, as the example of the visualization in Figure 14 shows. The results of this network are considered in detail in Chapter 3.3. Figure 14 The Influential persons are readily apparent in this network. 3.1.5. Diagrams Under the View menu, there are options to create word cloud as well as several diagrams to visualize the data. In addition to the already existing graph to visualize the contribution index, five more measures have been added, which are based on the text analysis: They are: Activity, Sentiment, Emotion, Complexity and Influence One of the new graphs combines the representation of four values: Activity, Sentiment, Emotion, and Complexity. These values were calculated beforehand from the analysis of the text contents in the network. If no values have been calculated, an appropriate error message, as seen in Figure 15 appears. 40 Social Network Analysis Figure 15 Error message appears when no text analysis has taken place. To create the new text diagram, under the View menu select "Sentiment over time" which displays by default the values of Activity and Sentiment. Users have the option to check Emotionality and Complexity as additions to the diagram, which, if selected, the axis’s are automatically adjusted for the selected data. Figure 16 represents an e-mail network in a time frame of 10 days, which has large variations in the week with respect to the activity and had relatively stable Sentiment. Alternatively, such a diagram also can be open through the context menu of a single node, thus making it possible to observe the properties of the selected actor. In this case, the value "Influence" can be visualized in addition, if previously "Calculate Influence" was computed. Figure 16 sentiment and activity represented as a graph. Depending on the period of the data set, the following chart will automatically use the appropriate time units according to the scheme in Table 2. Minimum 0 Seconds 30 Minutes 1 Day 10 Days Maximum 30 Minutes 1 Days 10 Days Unlimited Selected unit Seconds Minutes Hours Days Table 2 time units are dependent on the existing period. 41 Social Network Analysis For a better representation, the data are smoothed via at least plus and minus q time units of time. q is at least 1, but the maximum number of available time units divided by 50 (rounded down). For example, in a period of 200 days Activity is presented as average number of messages per day, smoothed over 9 days, or 4 days before and 4 days after. The diagram’s data can also be exported as a Csv file and explored further in other tools, such as Excel. The exported data are also smoothed in this case, so that the representation does not deviate. However, an export of the unsmoothed original data is planned for a later version release of Condor and will be accessible via the Export menu. Word Usage Figure 17 is an example of a Word Usage Graph for a Twitter network, in which the four most important words: - SVP, Switzerland, initiative and @1zu12 - are examined in more detail. The graph shows the frequency with which the words are used. This is specified as the average number of messages per day for each corresponding word. The table also shows the most important facts about each word, notably the following: • • • • • Number of mentions of the word Number of persons, who used the word The person who used the word for the first time The person who had the highest Influence on the spread of this word " Early adopters ", or the first 15% of people who used the word [20] 42 Social Network Analysis Figure 17 Overview of the most important words of the network and its dissemination The diagram is particularly useful when data for a network over a long period of time is available. Such datasets enable users to observe who are the Influential people on what topics or issues over time. The data is smoothed to provide a sensible representation. 43 Social Network Analysis 3.2. Test Data: Overview and Treatment This chapter provides an overview of the test data used to analyze the results of the calculation of the Influence measure and describes how the data was created. The test data should provide, among other information as to whether it is possible with the computation of the influence the ability to find "important" people in a network. The communication medium Twitter is very well suited for the collection of test data, as it can be used for various purposes and is very accessible. In this work, the question arises: What topics most benefit from the visualization of the influence of individual persons? A realistic application area could be in marketing area, to find important people who could convince others to use a particular product. Another potential application could be in journalism, for reporters to find a relevant informant or news source in order to get crucial information on a story. In both cases, a possible dataset could be created based on a small private Fetcher for people on Twitter, in which connections between the actors are direct communication by means of an @username, as well as, a Retweet, if created. The network of Twitter "followers" is less favorable to collect and analyze because this can quickly become unreasonable in size, and often does not mean in such a relationship, that the person actually reads the tweets of another person. Another example could be data for e-mail networks, such as the data collected from project teams would have the potential to be quite useful, especially if additional information is known, such as the performance of a team or hierarchical structure of departments, etc. Another data source could be personal e-mail networks, to show an individual user, who is especially important in his or her personal environment. 3.2.1. Twitter: Politics In a prior work [8] [26] by the author, a way was described to analyze the behavior of Swiss politicians in Parliament. In this network, the Twitter Accounts of 71 MPs of the Parliament Building are included, but lacks individual direct communication with other politicians are not visible on the network. Taking into account both the National Council and Council of States, as in the case of Federal Alain Berset. For the analysis of up to 200 tweets per person during the period 28 November 2013 to 28 January 2014 were used. 44 Social Network Analysis Overview: • Total number of actors: 1,433 • Investigated persons: 71 • Number of tweets: 9,936 3.2.2. Twitter: BMW In order to measure the success of a brand, various methods can be employed. Very often Sentiment analysis is used to gain an overview of the opinion of the crowd, [27] but often it is also useful to look for people, which the brand is able to successfully get people talking about it. The question is whether the value of the influence in this typical cool hunting task may be helpful? As an example of such an application, tweets are to be analyzed in a short period about BMW. For the creation of the data set used, the normal Twitter Fetcher, which collected for 8 hours tweets with the search term BMW. In addition, persons composed of 2,887 found, 50 Twitter accounts were within that data set, determined in which in recent tweets the word "BMW" most frequently occurred. This was in particular the official accounts of BMW, as BMW_Deutschland or BMWGroup, etc. Of these people the last 20 tweets were collected to further expand the network and strengthening the network in each case. The case should serves as an example for an analysis of a short time span to gain insight into the Twitter world around a discussion of a brand. Overview: • • Total number of actors: 2,887 Number of tweets: 5,964 3.2.3. E -mail: COIN Seminar In the fall semester of 2013, a course was held in several universities on the theme " Collaborative Innovation Networks ", in short COINs2013. This involved students from five universities: MIT, SCAD, Aalto University, University of Cologne and University of Bamberg, who participated at the same time in the course using their local language. Cross-university, time-zone border project teams were created who worked together for the term/semester. This seminar has been conducted for several years and offers an ideal opportunity to examine e-mail networks, which has already been collected several times on numerous projects. [28] 45 Social Network Analysis A special feature of this course is that the students use the Condor software including the author to analyze the e-mail communication within their project teams. All messages are cc’d to a dummy e-mail address throughout the course. Thus, there is a separate Gmail account in which all, or almost all, the messages are stored, which have been sent among the student teams, instructors and teaching assistants, which result in an interesting communication network for each project team. The students involved have all agreed to use all their email messages for research and analysis purposes. The data itself is anonymous in this report and were kept confidential. Overview: • Participating students: 52, divided into 11 teams • Total number of nodes: 130 • Total Posts: 1,563 • Total number of edges: 7,893 3.2.4. E -mail: Private E -mail accounts A common use of Condor is the analysis of the personal mailbox, even if it is just to learn to use the software. The advantage of such analysis is that a lot of information of the network and the actors are already known and easily can be compared with the results for face validity by the mailbox. The system was tested with multiple private mailboxes of different people in order to determine whether the collection of the most important people in the network coincides with the subjective opinion of the account holder. In this report, two networks are listed as examples, which are analyzed anonymously. Overview: • Available Actors: 2,215 • Total Posts: 1,311 • Total number of edges: 11,187 46 Social Network Analysis 3.3. Evaluations The values listed in Section 3.2 networks were prepared with Condor and examined. Here, the module "Calculate influence" was used as described in Section 3.1.2 to find the most influential people in the network. Depending on the network are appropriate dependent variables present in order to justify the relevance of the value of “influence”? An important question that must be answered in the evaluation of the data is the following: Is the "Influence" measure more appropriate than the betweenness centrality measure in its ability to identify the "importance" of a person? The analysis shows that the definition of Influence in all four applications is very well suited to obtain information about the relevant actors of the considered network. 3.3.1. Twitter: Politics In Switzerland, approximately one-third of the Parliament has a Twitter account. But while some are very interactive and involve many other people in the conversation others are not so involved in the communication network. Figure 18 shows the coarse structure of the network, all parliamentarians are colored according to their party and other people are represented in the network as points. The structure of the network is not very informative, only the gross distribution of the parties is impressive. 47 Social Network Analysis Figure 18 Parliamentarians in this network are colored according to their party affiliation; other Twitter accounts are drawn as points. The crucial question is, which of the politicians are the "most important" in the network? To answer this question a comparison of the politicians are compared across four measures: influence, betweenness centrality, number of followers and activity / interactivity. The latter example is from researchers of the blog Some Polis – who have explored social media and politics, where “activity / interactivity” was used as an important measure. [29] This is just one of very many other scales used to assess the importance of people on Twitter, besides a person’s propriety Klout score [30] or a combination of Retweet measurements and mentions used. [31] Figures 19, 20 and 21 each show a section of the network with the most important politicians using the three scales: Number followers, Betweenness Centrality and Influence. The size of the nodes depends on the chosen scale and the top 20 are actors are noted by name. Figure 19 Number of followers as a measure of the importance of politicians Alain Berset has the most followers, as he is part of the Swiss Federal Council, but he is relatively inactive. In contrast, Cédric Wermuth and Natalie Rickli, are much more active and seem to have obtained a sizeable number of followers. Christian Levrat and Pascale Bruderer with the fourth and fifth most followers have a wide gap in their tweeting activity: 475 and 97 tweets respectively. This suggests that the number of followers is mainly dependent on external factors, i.e., political awareness, and not from the actual activity on Twitter, at least from this sample. Influence may not be associated with the number of "followers" which has been reported in previous work. [32] 48 Social Network Analysis Figure 20 betweenness centrality as a measure of the importance of politicians Figure 21 Influence as a measure of the importance of politicians The Betweenness Centrality shows a different picture and favors those who interact with many other people on the network, from either retweets or mentions of user names. The list is topped by Aline Trede, which is connected to all 764 people in the network ( degree centrality ), which is remarkable given 851 tweets. However, this is rather an outlier, other people with high betweenness centrality tend to have relatively high numbers of tweets. The third illustration shows the important people as measured by their Influence score on the network. At first glance, some similarities with this picture can be 49 Social Network Analysis seen in the visualization of the people with a high Betweenness Centrality. It appears that people with high betweenness scores also have a high Influence impact score. However, that is not always the case. Bernhard Guhl of the BDP, is someone who ranked 3rd on Influence, but ranked 23rd on Betweenness. The data can also be compared in tabular form as summarized in Table 3 for which 20 persons are ranked by their influence score. In total, 57 people were found in which an impact was measurable. The full table can be found in Appendix A. Table 3 List of 20 MPs with the highest Influence on the network of a total of 57 people. Analysis Is this list an accurate ranking of the most influential people on the Twitter network? And, if so, why? What other evidence would serve to corroborate these findings? Even if these questions are very subjective, a few examples provide support for the accuracy of the Influence scale. Christoph Wasserfallem, ( Vice President FDP), is ranked 1st on Influence and ranked 3rd on Betweeness. He has long been active on Twitter and is most often retweeted by all parliamentarians. This has a very positive effect on both his Influence score, as well as, on his Betweenness Centrality. In comparison, Balthasar Glättli , (Green Party), who ranked 10th on Influence and 6th on Betweenness has almost three times as many tweets, but these tweets are rarely retweeted. Kathy Riklin (CVP), who is ranked 2nd on Influence and 14th on Betweenness is a little less active, however, she writes about a very broad set of 50 Social Network Analysis topics; often communicates with other parliamentarians; and has more than 40 of them follow her on Twitter, which sets her apart as a front runner. Bernhard Guhl (President of BDP Party in Aargau), who ranks 3rd on Influence, is low on Betweenness with a rank of 23rd. He was less often retweeted, but nevertheless has a far amount of Influence on the network by this calculation. This was done by being credited with spreading the hashtags # stk2014, which stands for the Electricity Congress 2014. Tweets with this hashtag were often retweeted. The table also shows that 9 of the top10 people who scored the highest on the Influence metric also are among the top 20 on the Betweenness Centrality list. The only exception is Bernhard Guhl, who ranked 3rd on Influence, but low on betweenness at a rank of 23rd. Another noted outlier is Mathias Reynard (SP), who ranked 47th on Influence and 9th on Betweenness. This can be explained with the fact his tweets are entirely in French, and most German-speaking parliamentarians less often retweet foreign language tweets or communicate across the language border. Hardly any correlations can be observed between the number of followers and the other two values of Influence and Betweenness Centrality. This is also reflected in the fact that some politicians like Jacques Neirynck (CVP) already have large number of followers without ever having written on Twitter. 3.3.2. Twitter: BMW The analysis of the topology of the Twitter: BMW network provides an immediate insight that there are two subnetworks where people tweet each other, while most people have no direct connections to the other. In Figure 22, two subnetworks are shown. Tweets in the green colored area are written in Spanish, and the network is centered around @BMWEspana. The larger network in the blue shaded area includes only English tweets and has the account @BMW. Also included in the entire network are more official accounts of BMW, but also bloggers, reporters and BMW licensed dealers. 51 Social Network Analysis Figure 22 The Twitter:BMW network indicated two subnetworks: the smaller green Spanish network and the larger blue English network. People in such a sparse network have correspondingly poor chances of gaining Influence. Nevertheless, a calculation of the Influence does show a clear winner: @ Ocean_BMW. Figure 23 shows a small part of the network in which this account is visible. The Account Owner is an officially licensed BMW dealers in Plymouth (UK ) called Ocean BMW. Figure 23 is the node size depends on the influence of people. In this area of the network Ocean_BMW seems to be very influential. The analysis shows that Ocean_BMW could bring in particular, the terms #plymouth and showroom into the network. It is no surprise that BMW Ocean during the search date on 5 February 2014 just inaugurated a new showroom in 52 Social Network Analysis Plymouth (UK) and it presented new BMW models. After Ocean_BMW initially wrote about it6, consequently more tweets by others on this topic followed. This is also apparent in the word cloud view of the discussion. In Figure 24 there were a total of 152 mentions of @Ocean_BMW ( in the figure only as @ocean shown) in which the usually comments were about to come and see the new showroom. Figure 24 The word clouds view scales the Bigger the words with their frequency of occurrence in the data set. Green words are positive, red is negative. In contrast to the analysis of the politicians, the Twitter:BMW test data was confined to a narrow time limit. The value of Influence measure is therefore precisely to identify the most important persons who were in the observed period of observation. This is a big advantage over other methods that would take into account the total number of followers, retweets, etc. The example thus shows that even in networks that can be created in a relatively short time with simple means, Influential people can be identified with a few clicks. 3.3.3. E-mail: COIN Seminar A simple structural analysis of the COIN course network can answer some fundamental interaction questions without using any text analysis. Specifically, to what extent did students engage in cross-team collaboration? A simple visualization using the Fruchtermann - Reingold layout algorithm provides a picture of the degree to which team members were networked together. But if we consider only the connections themselves, important information remains hidden. For many teams, members email everything to everybody and consequently their centrality values are very low or identical. Figure 25 shows an example of a project team with 6 members, with the size of the node scaled to reflect their Betweenness Centrality. The structure of the project team does not 6 Ocean_BMW mentioned the showroom in 6 tweets. The most successful of these can be reached at: https://twitter.com/Ocean_BMW/status/430997170947252224 53 Social Network Analysis reveal very much beyond the fact that each team member has at least once communicated with each other and one member has a single connection to an external person outside the project team. Figure 25 A calculation of the betweenness centrality provides little new information about the network In contrast, a much different picture emerges when the messages are themselves analyzed. A scaling of the node size as a function of the Influence of the nodes leads to Figure 26. Here it is clearly visible that the people in the network were not exactly equivalent, such as, for example, the Betweenness Centrality value would indicate.  Figure 26 shows the value of Influence, that there are certain differences between the people of the project team . 54 Social Network Analysis The question now is whether the person in the middle, did in fact exercised the greatest influence on the project team? And, if so, whether this also applies to other project teams. To investigate this, a survey was conducted of the project teams, in which each participant in the COINs was asked the following question: Who in your team had the greatest influence on the result of your project? It was pointed out that the information will be kept confidential and it was optional that respondents could nominate themselves. The survey was conducted with participants of the current course of 2013 towards the end of the course and also with the participants of last year's course, some 10 months after the completion of the course. Table 4 shows a summary of the number of responses received. As might be expected, response rates from participants of the course in the previous year were relatively low. Participant Project Teams Replies Teams without answers 2012 52 10 12 4 2013 51 11 33 1 Table 4 Summary of responses to the survey regarding influence in the project team In total, data on 16 project teams from 45 out of 84 people were obtained. Since the question can be answered very subjective, the answers in most teams are not unanimous. The data can be used as a comparison to the calculated value of the Influence measure, but it must be noted that some uncertainties exist. Nevertheless, evaluating the results of the participants' responses against the calculated Influence scores for each project team does serve as valid quality check. The following Table 5 describes the project team 7 survey results from the course of 2013, and lists the number of votes per person and the calculated Influence score according to the analyzed e- mail network: Vote Person A Person B Person C Person D Person E 2 1 Influence 13.41 22.21 7.89 4.77 19.23 Number of Messages 143 188 100 54 198 Table 5 Example of the obtained data of a project team Two people have voted in favor of person B as the most Influential person and one person thought Person E was the most influential person. The calculation of 55 Social Network Analysis Influence in the communication network reflects this situation perfectly because person B has the highest Influence value, followed by Person E. This analysis was repeated for the other 15 project teams, with the following results: • In 10 of the 16 teams the person who received the highest Influence score also had the most votes • In three other teams the person with the highest Influence score received at least one vote. • Only three teams showed no positive correlation between the number of votes and the Influence score. The full results are in Appendix B. Of the three teams in the last category with no correlation, one of the project teams did not send all their communication to the dummy gmail address. In the other two cases, the project members have selected a person, which in the analysis of e -mail communication was only moderately influential. Maybe for these two project teams, people have contributed significant to the result of their team, but do not use email as the dominant mode of their communication. It should be noted that the teams from 2012 did have better matches between their Influence score and number of first place votes. This could be related to the higher amount of emails for these project teams. In contrast, the 2013 course participants were stronger communicating via Skype and other chat-programs. The example of these 16 project teams shows that the calculated value of 'Influence' often correlates very well with the subjective opinion of the people of the communication network. An observation is that the analysis of the e-mail communication network is not sufficient in every case to identify the most influential people, and it may depend upon other circumstances outside of the data collection. The value of Influence measure should therefore not be used, for example, to make statements about the quality of employees or individual project members contributions. Influence, may however, be a useful measure of the analysis of the communications. The Influence measure does take into account much more information than, for example, the Betweenness Centrality and very often found the most important actors in a network. 3.3.4. E -mail : Private E -mail accounts In the reviewing the topology of the private e-mail network, it is usually difficult to get an overview. Often there is a more or less a star-shaped network, where everyone is linked to each other. Figure 27 shows an example of a network of student’s e-mail network. 56 Social Network Analysis Figure 27 The betweenness centrality in this network has only a very limited explanatory power. Many of the people with high Betweenness Centrality scores have actually sent a few messages, but use a very large distribution list. The node, which is part of the e-mail address " _D_E18_62_I_Studierende@mx.ds.fhnw.ch" has for example, the fifth highest betweenness centrality, but is obviously only a distribution list. The Betweenness Centrality value indicates the crucial links between the existing sub-networks, but does not correspond to what the account owner would subjectively classified as most "important." If the the nodes of people, who never sent a message were removed, it would result in a star-shaped structure and the relevance of Betweenness Centrality would decline even more A much different picture emerges when the node size is scaled with the Influence measure, as can be seen in Figure 28. The people who are drawn correspondingly large in this figure are mainly people from a narrow circle of colleagues, as well as the student’s individual faculty / advisor. So people who were actually relevant in this communication network with respect to the student’s life are identified. 57 Social Network Analysis Figure 28 individuals with high Influence are more centrally in the network . A very similar situation was observed in another e-mail account. The email account of a FHNW employee, who mainly communicates in the observed period with two other employees. In Figure 29, these employees are colored in yellow and light blue, while the node of the account owner is itself represented in green. Figure 29 Node size corresponds to the betweenness centrality of the nodes. The person in dark blue very often sends e-mails with large distribution lists and little content and has the second highest Betweenness Centrality score. 58 Social Network Analysis A comparison with Figure 30 shows that the dark blue person had virtually no influence on the network. The other two who have actively cooperated with the person in green are clearly important in this communication network. Figure 30 Node size corresponds to the influence of the node. 3.4. Discussion The previous chapter dealt mainly in classifying the actors of the network in accordance with their Influence. In this case, additional node and edge metrics have been computed. This chapter describes patterns in the data, which can be found using the statistical technique of analysis of variance. 3.4.1. Properties of influential actors Are there common properties of influential actors? Yes, the critical factors can be found quickly: Activity ( number of messages ) and betweenness centrality, both have a significant correlation value, based on an analysis of variance at the 1% level. This means that, for example, persons who actively communicate with multiple project teams and thereby pass on ideas and solutions between the project teams and their members are particularly influential. Note: At times, even some "spammers" are very active and very central in the network, but overall, have a lesser impact on their surroundings. This can apply to be both Twitter as well as e- mail networks. Such “spammers” can be characterized as often using the same words. A Twitter account, which always writes about the same issues, has much less influence than someone who takes 59 Social Network Analysis up various topics, even if disseminated to only individuals in the same network, since the effect of this is dependent on whether someone himself already used the words often. So, can someone who already does a lot of writing on a particular topic be affected by a "spammer", which uses the same terms? No, even if someone is frequently retweeted on Twitter, this is useless if it is always the same people. 3.4.2. Properties of influential messages It is practically impossible to predict with a sufficiently high probability that a message will be influential or not, without possessing sufficient meta information. Nevertheless, there are certain properties for successful messages. Text length is a crucial factor, especially on Twitter. Very short, tweets are often not worth it to be retweeted, since they contain hardly any important information. For e-mails, it looks similar. Very short messages often contain little information, such as, for example, the commitment to an event or even just a reference to another with a “best regards” in the end. Sentiment and emotion do not seem to have any influence on the impact of a message. No correlation was observed. High complexity seems to have a very small, positive effect, but cannot be confirmed with the test data at the 5% level. For e-mails a key factor is the number of recipients in a message. The higher the number of recipients, then the less the impact has of the individual message. This shows that it is not useful to cc many participants on a communication, because few people pay attention to it. On Twitter there is no harm, however, for people to mention others with an @Mention, because people then feel they have been directly addressed and react with a greater likelihood of replying. 3.5. Influence Advantages and Other Applications The calculation of the new Influence metric for network analysis brings several advantages, since with this value the actual text is analyzed, instead of only the connections, and thus this measure is much more difficult to manipulate. Pure structural network metrics such as Betweenness Centrality can be easily manipulated by a large amount of connections to other nodes. In contrast, to get a higher Influence score, you have to actively communicate to others and then have those other people change their behavior and include new terms in their subsequent messages. Of course, the Influence metric too can be deliberately manipulated. For example, two people could send each other dictionaries, but this would take significantly more effort and "criminal energy" is required. 60 Social Network Analysis Although there is still the danger of manipulation, taking into account the text is a good step towards a more balanced analysis of the communication behavior. Even Google no longer exclusively uses the weighted PageRank to present search results and benefits from using text analysis. Social network conventional tools for analysis of graphs, such as Gephi [33] unfortunately do not offer text analysis opportunities and thus leave out the content dimension in network analysis. Section 3.2 describes four specific applications for the use of new Influence, metric, but the possibilities for additional use cases are nowhere near exhausted. For example, blogs have not been taken into account, but these have a lot of potential for interesting research. An influential blog would be one, which quickly takes up new topics, and then readers then communicate these new topics, ideas and innovations to other blogs. This raises the question of whether some frequently simply copy from others. In email networks, the main problem of using the Influence metric is the accessibility to people’s email. However, private networks can be easily analyzed, including for example one’s own e-mail as Section 3.2.4 shows that the analysis is likely to find important people in the network. 61 Social Network Analysis 4. Conclusion The inclusion of text analysis allows important insights into the analysis of social networks. The calculation of the influence of a single message, and its direct impact on a receiver is a useful extension and generalization of existing approaches, which often work only for individual, predefined networks. The biggest challenge is addressing the variety of individual network properties that need to be taken into account in order to convert the messages into a common schema for efficient analysis. However, this study demonstrates that these challenges can be overcome and it is possible to trace the diffusion of new ideas, words and concepts among users over time based on the content of their digital communication. A disadvantage of the method is it is not optimized for a particular network, or for a specific language. The Influence metric calculation assumes that people have not used identified keywords in prior communications, but this assumption may not always be true, because of the lack of a sufficient historical data going further backward in time. However, the selected four test cases have demonstrated that a relatively wide range of possible applications can be covered with meaningful accuracy. Compared to the common structural network measures, the new Influence content measure has out performed them in identifying the influential people in a human communication network. 62 Social Network Analysis 5. Recommendations for Future Work With this work, the possibilities of textual analysis of Condor 3 were significantly increased, but there is significant room for improvement. Currently, the semantics of the text is hardly checked, and the introduction of an ontology could be useful. Thus, it might be possible to detect influential messages even when the receiver is not using exactly the same words, but talks about the same subject with a related lexicon. It should also be tested if better standardization of texts optimizes the analysis even more. To investigate the value of the Influence measure even further, additional networks such as, Tumblr, or something similar, could be considered. With enough data, it should be possible, for example, to find persons who are in particularly successful for the rapid spread of the Internet Memes [34]. Such studies can help to understand how information spreads on the Internet and what factors play a role. The findings could then possibly be used to predict political unrest in crisis-endangered regions. [35] Condor 3 has with the extensions created in this work, a large amount of analysis tools, which can be used for various purposes. It cannot map all the occurrences of a network, but the existing information is sufficient for several analyzes, which open the door for future research in network analysis and in the field of Cool Hunting. 63 Social Network Analysis 6. Bibliography [1] P. Gloor and S. Cooper, Coolhunting : Chasing Down the Next Big Thing, AMACOM, 2007. [2] LC Freeman, "A Set of Measures of Centrality Based on Betweenness, " Sociometry, Vol 40, No 1, pp. 33-41, 1977. [3] L. Page, " PageRank : Bringing Order to the Web, " Stanford Digital Library Project, Stanford, 1997. [4] P. Gloor, " Condor Core, " Galaxy Advisors, in 2012. [Online]. Available : https://galaxyadvisors.com/services/condor-core.html. [Accessed on January 10, 2013] . [5] R. Diestel, Graph Theory, Berlin, New York : Springer -Verlag, 2005. [6] G. Sabidussi, "The centrality index of a graph, " Psychometrika, vol 31, pp. 581-603, 1966. [7] P. Gloor and Y. Zhao, " TECFLOW - A Temporal Communication Flow Visualizer for Social Networks Analysis, " ACM CSCW Workshop on Social Networks, 2004. [8] L. Brönnimann, " Multilanguage - sentiment analysis of Twitter data on the example of Swiss politicians, " Windisch, 2013. [9] F. Lüscher and M. Brun, " Sentiment Analysis, " University of Applied Sciences Northwestern Switzerland, Windisch, Switzerland, 2013. [10] A. Go, R. Bhayani and L. Huang, Twitter Sentiment Classification using Distant Supervision, Stanford, 2009. [11] Apache OpenNLP Development Community, "Apache OpenNLP Developer Documentation, " in 2012. [Online]. Available : http://opennlp.apache.org/documentation/1.5.2- Incubating / manual / opennlp.html. [Accessed on January 10, 2013] . [12] E. Bakshy, I. Rosenn, C. Marlow and L. Adamic, " The Role of Social Networks in Information Diffusion, " in Proceedings of the 21st international conference on World Wide Web., ACM, 2012. 64 Social Network Analysis [13] X. Wei, J. Yang, LA Adamic and M. Rekhi, "Diffusion dynamics of games on online social networks, " in Proceedings of the 3rd conference on Online social networks., USENIX Association, 2010. [14] " Klout, " Klout Inc., 2014. [Online]. Available : http://klout.com/corp/how-itworks. [Access on 4 February 2014]. [15] G. Bogdan, A. Zygmunt and S. Podgorski, " Incorporating text into Evolution of Social Analysis, " Computer Science and Information Systems ( FedCSIS ), pp. 931-938, 2013. [16] J. Jang and S.-H. Myaeng, "Discovering Dedicators with Topic -based Semantic Social Networks, " in Seventh International AAAI Conference on Weblogs and Social Media, 2013. [17] J. Weng, E.-P. Lim, J. Jiang and Q. He, " Twitter Rank: Finding Topic sensitive Influential Twitterers, " in Proceedings of the third ACM international conference on Web search and data mining, ACM, 2010. [18] R. Dawkins, The Selfish Gene, Oxford University Press, 1976. [19] G. Salton, A. Wong, and C. Yang, "A Vector Space Model for Automatic Indexing, " Communications of the ACM, Vol 18, No. 11, pp. 613-620, 1975. [20] EM Rogers, Diffusion of Innovations, Free Press of Glencoe, Macmillan Company, 1962. [21] L. Brönnimann, "Sentiment in Word Clouds, " Windisch, 2013. [22] IR Authorities, "ISO 639, " 9 September 2013. [Online]. Available : http://www.iso.org/iso/home/standards/language_codes.htm. [Access on February 2, 2014]. [23] S. Nakatani, " Language Detection Library, " Cybozu Labs, Inc, Tokyo, Japan, 2010. [24] L. Brönnimann, "Analysis of communication from Swiss politicians on Twitter, " Windisch, 2013. [25] TM Fruchterman, "Graph Drawing by Force -Directed Placement ," Software - Practice & Experience, Vol 21, No. 11, pp. 1129-1164, 1991. [26] L. Brönnimann, " Swiss politicians on Twitter, " August 2013. [Online]. Available : http://www.twitterpolitiker.ch. [Access on 4 February 201]. 65 Social Network Analysis [27] H. Saif, Y. He, and H. Alani, " Semantic sentiment analysis of twitter, " The 11th International Semantic Web Conference ( ISWC 2012) 11-15 November 2012. [28] P. Gloor, M. Paasivaara, D. Schoder and P. Willems, "Finding collaborative innovation networks through Correlating performance with social network structure, " International Journal of Product Research, Vol 46, No. 5, pp. 13571371, 2008. [29] T. Grossenbacher, R. Straumann, F. Zirin and T. Wider, "Some Polis - Social Media and Politics, " 2013. [Online]. Available : http://www.somepolis.ch/. [Access on 4 February 2014]. [30] " Twitalyzer, " 2013. [Online]. Available : http://www.twitalyzer.com/. [Access on 4 February 2014]. [31] Edelman, " Tweetlevel, " in 2012. [Online]. Available : http://tweetlevel.edelman.com/About.aspx. [Access on 4 February 2014]. [32] M. Cha, H. Haddadi, F. Benevenuto and KP Gummadi, "Measuring User Influence in Twitter: The Million Follower Fallacy, " ICWSM, Vol 10, pp. 10-17, 2010. [33] M. Bastian, S. Heymann, and M. Jacomy, " Gephi : an open source software for exploring and Manipulating networks, " in International AAAI Conference on Weblogs and Social Media, 2009. [34] M. Coscia, "Competition and Success in the meme pool : a Case Study on Quickmeme.com, " Center for International Development, Harvard Kennedy Schoo, 2013. [35] H. Philip, A. Duffy, D. Freelon, M. Hussain, W. and M. Mari Mazaid, "Opening Closed Regime : What Was the Role of Social Media During the Arab Spring, " PIPTI, Seattle, 2011. 66 Social Network Analysis 7. List of Figures Figure 1 Illustration of a Twitter network, in which all connected by the node in the middle Figure 2 Representation of a Twitter network with and without actors without outgoing Compounds Figure 3 Random elected politicians are often connected to each other through a few nodes Figure 4 The e- mail network of a single person can be very different structures Figure 5 In this e- mail network project teams communicate primarily among themselves Figure 6 In this representation, the color value shows the betweenness centrality of the node. Dark blue nodes have the highest value, the deepest red Figure 7 Screenshot of the user interface of Condor Figure 8 The default settings should automatically use the most likely field for analysis. Figure 9 Menu to calculate the influence Figure 10 The Fruchterman - Reingold algorithm provides a useful visualization of the Graphen Figure 11 The same network as in the previous figure benefits of different options Darstellung Figure 12 details of a single node after analysis Figure 13 details a single edge after analysis Figure 14 The influential people in this network quickly apparent Figure 15 Error message appears when no text analysis has taken place Figure 16 Sentiment and activity diagram Figure 17 Overview of the most important words of the network and its distribution 67 Social Network Analysis Figure 18 The parliamentarians in this network are colored according to their party affiliation, other Twitter accounts are drawn as points Figure 19 Number of followers as a measure of the importance of politicians Figure 20 betweenness centrality as a measure of the importance of politicians Figure 21 Influence as a measure of the importance of politicians Figure 22 On the picture are two subnetworks seen, the smaller with the larger Spanish and with English tweets. Figure 23 The node size depends on the influence of people. In this area of the network Ocean_BMW seems to be very influential Figure 24 The word clouds view scales the Bigger the words with their frequency of occurrence in the data set. Green words are positive, red is negative. Figure 25 A calculation of the betweenness centrality provides little new information about the network Figure 26 Shows the value of influence, that there are certain differences between the people of the project team. Figure 27 The betweenness centrality in this network has only a very limited explanatory power Figure 28 Individuals with high influence are more centrally in the network. Figure 29 The node size corresponds to the betweenness centrality of the nodes. Figure 30 The node size corresponds to the influence of the node 68 Social Network Analysis 8. Glossary Betweenness Centrality The betweenness centrality is a commonly used structural network metric based on the central node topology. This is done by measuring the number of times a node is found between all other nodes along the shortest path in a network. The exact calculation can be found in chapter 2.2.1 Cool Hunting Cool Hunting is often referred to as a trend hunting and consists of the professional looking for new trends and trendsetters. So-called Coolhunters investigate, among other things, the current youth culture and provide their insights and predictions about the further development of companies that try therefore to obtain a market advantage. Hashtag Tweets are often provided with so-called hashtags, which consists of a pound sign ( # ) followed by a term or consist of an abbreviation. The hashtag #1zu12 describes, for example, that it goes to 12 initiative in the Tweet to the 1, which would not have been immediately obvious perhaps from the rest of the tweets. Metrics In the context of this work, metric indicates a measurement system, which measures quantifiable units according to a predetermined scale. A metric, for example, the number of sent tweets, or the number of followers. Natural Language Processing Using Natural Language Processing ( NLP for short ) can be written by people to analyze text. NLP is an umbrella term for algorithms and methods for text analysis. For example, the automatic translation of texts, or the determination of the language of a text. Also, Sentiment analysis is a subcategory of Natural Language Processing. Retweet Further distribution of tweets by another person on Twitter called Retweet. Such a tweet is marked before each message with the letters RT, followed by the user name of the original sender. Sentiment With the sentiment, the content and the ideas of a sentence can be classified. A text can be, for example, positive, sad, happy, angry or neutral. In this work, a distinction is made between positive, negative and neutral sentiment. 69 Social Network Analysis Sentiment analysis In sentiment analysis attempts to determine the sentiment of the text. With machine-based method, the polarity of a text is determined, different algorithms can be used. Stopword Words that occur very often in a given language, and thus provide little information are called stop words. In English, this is for example the article "the " or the word "and". Tweet A single message on Twitter called a Tweet, which has a maximum of 140 characters. Tweeted The process of writing a tweet and to publish it on Twitter called a Tweet . Word Cloud To illustrate the most important concepts in a text or a collection of texts, a word cloud is often used. This is an image consisting of the most frequent words in a text. The more frequently the word appears, the larger the word is displayed in the word cloud. 70 Social Network Analysis 9. Honesty Statement I hereby declare that this project is the result of my independent and independent work. All sources used are given and citations were marked as such. Date / Place Lucas Brönnimann, MSE student 71 Social Network Analysis 10. APPENDIX A Here is the complete list of parliamentarians with a Twitter account, according to the analysis of section 3.3.3 Person Influence Betweenness Followers Rank: Influence Rank: Betweenness Rank: Followers 72 Social Network Analysis APPENDIX A (continued) Here is the complete list of parliamentarians with a Twitter account, according to the analysis of section 3.3.3 Person Influence Betweenness Followers Rank: Influence Rank: Betweenness Rank: Followers 73 Social Network Analysis 11. APPENDIX B Here the students of COIN 2013 are listed with the complete evaluation. The evaluation can be found in chapter 3.2.3 Name Answers Influence Number of Messages 74 Social Network Analysis APPENDIX B (continued) Name Answers Influence Number of Messages 75 Social Network Analysis APPENDIX B (continued) The following table refers to the analysis of the COIN, 2012. Name Answers Influence Number of Messages 76