Analysis Of The Diffusion Of Innovations In Social Networks Master

advertisement
Analysis Of The Diffusion Of Innovations In Social Networks
Master Thesis
By
Lucas Brönnimann
Brugg- Windisch FHNW
07/02/2014
Note: Original manuscript was in German and translated with
http://translate.google.com/#de/en/ and edited by
Ken Riopelle
Social Network Analysis
ABSTRACT In social networks, new ideas and thoughts can spread quickly. A diffusion
analysis identifies which topics propagate through the network, the speed with
which this happens, and the people who are particularly successful at
communicating their ideas. This work is particularly concerned with the last point
and seeks to find the influential people, who efficiently pass on to their immediate
communication network their behavior terms and concepts.
With the use of multilingual text analysis methods, a diverse set of
communication networks are investigated and evaluated with a proposed set of
content measures derived by the language and behavior of the people who
communicate with each other. The goal is for the detection of the most important
or influential actors. The term "influence" is defined as the effective impact of a
person on others within an observed communication network.
To validate the content derived Influence measure, four different communication
network data sets are evaluated, which include: e-mail communications of
individual actors, project teams and Twitter sub-networks. The results indicate
that the operationalized definition of Influence derived from text analysis is likely
to find important people within a communication’s network, even in situations in
which conventional methods have proved unable.
2
Social Network Analysis
Contents
Abstract
1. INTRODUCTION .............................................................................................................................. 6 1.1. ABOUT THIS DOCUMENT .............................................................................................................................. 6 1.2. BACKGROUND ................................................................................................................................................ 6 1.3. OBJECTIVE ...................................................................................................................................................... 7 1.4. SOFTWARE DEVELOPMENT ........................................................................................................................ 7 2. ANALYSIS OF COMMUNICATION ............................................................................................... 9 2.1. STRUCTURE OF COMMUNICATION NETWORKS ........................................................................................ 9 2.1.1. Twitter .................................................................................................................................................... 10 2.1.2. E -­‐mail networks ................................................................................................................................ 12 2.1.3. Additional Networks ........................................................................................................................ 13 2.2. NETWORK ANALYSIS ................................................................................................................................. 14 2.2.1. Centrality .............................................................................................................................................. 14 2.2.2. Other metrics ....................................................................................................................................... 16 2.3. TEXT ANALYSIS .......................................................................................................................................... 16 2.3.1. Metrics for the analysis of text ..................................................................................................... 17 2.3.2. Advanced Text Analysis .................................................................................................................. 19 2.4. INFLUENCE AND INNOVATION ................................................................................................................. 22 2.4.1. Definition of influence ..................................................................................................................... 22 2.4.2. Changes in a person’s word histogram .................................................................................... 23 2.4.3. Influence of individual messages ................................................................................................ 24 2.4.4. Influence -­‐ combination of: similarity, relevance and time delay ................................ 25 2.4.5. Example ................................................................................................................................................. 27 3
Social Network Analysis
2.4.6. Tracing Innovation through text over time ........................................................................... 28 3. IMPLEMENTATION AND EVALUATION ................................................................................ 30 3.1. IMPLEMENTATION IN CONDOR ............................................................................................................... 30 3.1.1. Functionality of Condor .................................................................................................................. 30 3.1.2. Tools for Analysis of the network ............................................................................................... 32 3.1.3. Visualization of a Network ............................................................................................................ 37 3.1.4. Detailed nodes and edges information .................................................................................... 38 3.1.5. Diagrams ............................................................................................................................................... 40 3.2. TEST DATA: OVERVIEW AND TREATMENT ........................................................................................... 44 3.2.1. Twitter: Politics .................................................................................................................................. 44 3.2.2. Twitter: BMW ...................................................................................................................................... 45 3.2.3. E -­‐mail: COIN Seminar ..................................................................................................................... 45 3.2.4. E -­‐mail: Private E -­‐mail accounts ............................................................................................... 46 3.3. EVALUATIONS ............................................................................................................................................. 47 3.3.1. Twitter: Politics .................................................................................................................................. 47 3.3.2. Twitter: BMW ...................................................................................................................................... 51 3.3.3. E-­‐mail: COIN Seminar ...................................................................................................................... 53 3.3.4. E -­‐mail : Private E -­‐mail accounts .............................................................................................. 56 3.4. DISCUSSION ................................................................................................................................................. 59 3.4.1. Properties of influential actors .................................................................................................... 59 3.4.2. Properties of influential messages ............................................................................................. 60 3.5. INFLUENCE ADVANTAGES AND OTHER APPLICATIONS ...................................................................... 60 4. CONCLUSION ................................................................................................................................. 62 5. RECOMMENDATIONS FOR FUTURE WORK ........................................................................ 63 4
Social Network Analysis
6. BIBLIOGRAPHY ............................................................................................................................ 64 7. LIST OF FIGURES ......................................................................................................................... 67 8. GLOSSARY ...................................................................................................................................... 69 9. HONESTY STATEMENT .............................................................................................................. 71 10. APPENDIX A ............................................................................................................................... 72 11. APPENDIX B ............................................................................................................................... 74 5
Social Network Analysis
1. Introduction
1.1. About this document
This document describes several methods for the analysis of communication in
social networks. In the first part, network terms are defined as well as the
theoretical basis for the analysis of networks. Based on this, a system is
described which examines the diffusion of new ideas, concepts and innovations
for the purpose of identifying the relevant or most influential actors in this process.
Besides the theoretical description of the system development and function, a
number of enhancements for the software "Condor" are proposed and tested in
the analysis of networks. The second part of the document includes four test
cases using e-mail and Twitter communication datasets, which are used to
demonstrate and evaluate the text derived Influence measure’s accuracy and
utility.
Insights in the development and application of these test cases are documented
and are discussed in more detail. The conclusion offers suggestions for future
work.
1.2. Background
The analysis of social networks is used to identify current trends to find important
people in the network and to observe the distribution of information. This is also
known as cool hunting. [1] Previous studies are often based in large part on the
analysis of the structure of the corresponding network and use only well-known
network algorithms for finding central persons, as for example, the betweenness
centrality [2] or Google 's PageRank [3].
Other works deal with the analysis of the communication itself, and use different
themes from Natural Language Processing. For example, content of messages
can be studied with the help of Sentiment Analysis, to assess how positively or
negatively actors communicate in a network. Here, the structure of the network is
of little or no consideration.
Both of these approaches are implemented in the current version 3.0.a8 of the
Condor software. [4] This software consists of a number of modules to retrieve
data from multiple sources and the subsequent analysis and visualization of
these networks. However, the text analysis is limited to the analysis of
Sentiments and finding important keywords. Consequently, the software cannot
determine how topics spread in networks, or identifying the key influential people.
6
Social Network Analysis
1.3. Objective
The flow of information in social networks among actors is to be examined with
the aid of text analysis methods. To this end, appropriate metrics are defined
which describe the flow of information and the influence that an individual actor
makes on these measures. In particular, the metric for measuring the influence of
an actor within its network is the primary objective of this study and will be
reviewed for its quality and validity.
In order to measure the influence that the speech of the actors has on others, it
will be important to analyze speech over time. Several methods of text analysis
will be used to determine whether a message from A to B has an influence on –
the future messages sent by B. In addition, other text metrics will be examined,
such as, the average complexity of the vocabulary used, and the sentiment of the
text for its positivity or negativity.
The test data are collected networks of Twitter, as well as e-mail networks of
single individuals and project teams. Furthermore, factors, which impact the new
influence metric, are discussed along with its inherent limitations. This can be
used to determine what type of message, have a high chance of influence on
other actors in the network.
1.4. Software Development
The Galaxy Advisor’s software, Condor, is used to analyze social networks.
Additional functionalities, which calculate, among other things, the influence of
actors on their surrounding network including visualization were added.
Specifically, a text content processing Module1 was developed for this purpose
and incorporated into the software.
These enhancements are documented in accordance with the programming
policies of Condor and checked by means of JUnit tests. All of the new features
can be used productively in the current version of Condor 3.
11
See section 3.1.2 Tools for Analysis of the Network
7
Social Network Analysis
Table 1 shows the values that Condor can calculate through analysis of the
nodes and edges in a communication network. The four data elements written in
red italics have been added by this work.
Nodes
Edges (with text)
Degree Centrality
Text length
Closeness Centrality
Sentiment
Betweenness Centrality
Emotionality
Contribution Index
Response time
Influence
Table 1 Available metrics in communication networks
8
Social Network Analysis
2. Analysis of Communication
This chapter focuses on the theoretical aspect of the work. It consists of an
introduction to the structure and definitions of communication networks and
describes conventional and new possibilities for textual content analysis. This is
followed by an in-depth description of a method for measuring the influence of
individual actors and for tracing their impact on the diffusion of innovations in
communication networks over time. The listed methods of analysis for actors and
messages are those that are implemented in Condor 3 or were implemented
during the course of this work.
2.1. Structure of communication networks
In order to observe the diffusion of innovation and to measure the influence of
individuals in within the network, some network structures are clearly preferable
over others. An ideal data network meets the following conditions:
•
•
•
•
Access to all data in the network is available
All external factors are known and controllable
All relevant communication between the actors found in the observable
network are available
There are no irrelevant and not useful data included (spam)
While such ideal networks in reality do not exist, there are several that come
relatively close. In particular, e-mail networks in the academic and business
environment are to be considered for this purpose. Ideally, a study would have
complete email access to one or more project teams over a long period of time.
Of course for this email data collection, the consent of all parties is required,
which is not always practical.
Another very suitable network for analysis is Twitter, because the data is very
easily accessible. This includes Twitter direct messages, which are simply
marked with a “@Twitter name” and can be viewed by anyone.
Facebook is a possible data source, but is not as desirable because of its
restrictive private settings, which prevents much of the Friends communication
from being included. Thus, Facebook is not a desired data source for this study.
Likewise, other networks suffer from the problem that the data are difficult to
access and often reflect only a portion or small sample of the relevant
communication. Nevertheless, Condor’s "Web Fetcher" does make it possible to
9
Social Network Analysis
collect data from smaller communities, if they are made public on websites, blogs
or Wikipedia articles.
2.1.1. Twitter
Twitter is one of most easily observable networks, because of its API (Application
Programming Interface). A Twitter network can be practically observed, when it is
filtered or bounded by a search term, or by a follower network for analysis. It is
impractical to analyze all Twitter activity at once.
Twitter networks can be represented in different ways. In general, a graph is
created, in which all observed Twitter accounts mapped to nodes and the tweets
are represented in the form of edges. This is the case with Condor 3 and the
software is described in greater detail in Chapter 3.1.1.
A Twitter search on a term results is a set of tweets, which are composed of
authors or nodes and the edges between people are obtained by a so-called
@Mention, symbolized by an “@” sign followed by the desired user name within
the tweets. Also, retweets, are mapped by creating an edge between the author
of the original tweets and the person who retweeted it.
Depending on the Condor Twitter Fetcher options selected, there are differences
in the structure of the resulting network visualization. Figure 1 shows a network in
which the author of a tweet is linked to the search term, which yields a star
shaped structure with the search term or node in the center.
Figure 1 Illustration of a Twitter network, where all tweets are connected in the center to the
search term.
In contrast, Figure 2 represents two visualizations where the tweets are not
connected to the search term. The figure on the right shows the same thing, but
10
Social Network Analysis
without the non-affiliated nodes. It is apparent that these visualizations have a
very loose network structure without much communication among the users.
Figure 2 Twitter visualizations where the tweets are not connected to the search term.
As an alternative, if a person’s tweets are collected over time which include both
retweets and tweets with an @Mention to others, then this will create a denser
network and more amendable for the kind of text analysis of interest in this study.
Figure 3 shows an example of such a network in which a random selection of
Swiss politicians and their last 100 tweets were selected. So this denser network
structure is much more useful and better suited for the purpose of this work of
finding important people within a network. The ability to create such a network is
currently not available in Condor’s production version, but requires a relatively
small amount of code changes.
Figure 3 A random selection of Swiss politicians and their last 100 tweets forming a denser
network
11
Social Network Analysis
2.1.2. E -mail networks
E -mail networks are quite ubiquitous, but appropriate access is often difficult to
obtain, except in laboratory situations. Nevertheless, there are ways to obtain
e-mail data for analysis.
For example, it is possible to analyze ones own mailbox. Figure 4 shows a
network created by the analysis of a single mailbox of a student. The individual
nodes each representing an e-mail address and each sending of an e-mail
created an edge to every receiver, even if it was only in the CC. Note: persons
with multiple e-mail addresses can occur several times in different places in the
network, unless they are merged into a single e-mail address.
Figure 4 The e-mail network of a single person can have very different structures.
In cases where the appropriate privacy agreements have been implemented, it is
possible to analyze the e-mail networks of project teams. An example is shown in
Figure 5, which represents the e-mail communication of 11 project teams with an
average of 5 people per team. In this case, a dummy e-mail address was created
for each team, and team members copied all their email to this dummy e-mail
address during the course of their project.
12
Social Network Analysis
Figure 5 Example e-mail network of 11 project teams
A weakness in this data collection process is the very real possibility that a large
amount of communication among team members does not occur within the email network, because members use other channels of communication, such as,
face-to-face meetings, conference calls, or text messages.
Despite the limitations of e-mail networks, they have the potential for very deep
insight into the communication of the respective employees because often
people have saved an e-mail archive for a project. Using e-mail archives makes it
quite possible to observe longer periods of time and visualize the development of
collaborations between people and teams and trace new or important topics by
measuring their speed and diffusion throughout the network.
2.1.3. Additional Networks
To explore the development of new issues, data sources from blogs and
Wikipedia articles are also possible. In such a network, the nodes correspond to
the web page of the blog, or the Wikipedia article. The edges are links to the
respective other websites. The problem with blogs and Wikipedia articles is the
difficultly of determining where the information originated and how the information
flow behaves exactly. Thus, blogs and Wikipedia data are good sources for the
discovery of important issues and can be analyzed with sentiment analysis, but
are inherently problematic to assess the dissemination of information over time.
13
Social Network Analysis
2.2. Network Analysis
Network algorithms enable the calculation of various metrics, such as, network
centrality measures, which can be applied to all the actors. It is not necessary to
examine the content of the communication for these calculations. In this chapter,
the following network structure measures are described in more detail:
•
•
•
•
•
Degree Centrality
Closeness Centrality
Betweenness centrality
Activity and
Contribution Index
2.2.1. Centrality
For an actor, or node of a network, different measures for centrality can be
calculated. Important values include: Degree Centrality, Betweenness Centrality
and Closeness Centrality, which for each node of the graph
are
hereby determine a value for centrality with V nodes.
Degree Centrality
This very simple measure is defined as the number of edges connected to a
node. It describes the number of actors that have a direct connection with a node.
[5]
Closeness Centrality
The Closeness a node is calculated from the sum of its distances d to all other
nodes. Closeness Centrality of a node is defined as the inverse thereof. A very
central node, thus have a higher closeness centrality value and are often
connected to other nodes with similar high value. [6]
14
Social Network Analysis
Betweenness Centrality
The calculation of betweenness centrality is based on the shortest paths between
all the nodes of the network. The frequency with which a node is on the shortest
path between two other nodes is the decisive factor. [2]
The calculation is performed according to the following steps:
1. For each pair of nodes ( s, t ) calculate the shortest paths between them
2. For each pair of nodes ( s, t) calculate the proportion of the shortest paths,
the node V is found on
3. Sum this fraction over all pairs of nodes ( s, t)
This can be represented as a formula:
σ 𝑠 𝑡 is the number of shortest paths from node 𝑠 to 𝑡 and σ 𝑠 𝑡 (𝑣) is the number
Betweenness Centrality is especially useful in analyzing social networks,
because it often those nodes with a high value are the actors that communicate
across project teams. Thus, actors with a high Betweenness Centrality value play
a key role in the dissemination of information. Figure 6 shows an example of a
network, where the color value of a node indicates its betweenness centrality.
The dark blue nodes in the center have the highest betweenness centrality value,
and the red nodes on the outside have the least.
2
Figure 6 A node’s Betweenness Centrality with the dark blue nodes with the highest value, and
the red nodes the least.
2
Figure 6 created by Claudio Rocchini 23 April 2007
15
Social Network Analysis
2.2.2. Other metrics
Activity
Activity is simply the total number of messages sent by an actor. When
comparing different time periods, it is reasonable to normalized this count and
calculate the average number of messages per day.
Contribution Index
The Contribution index describes the "Balance" of an actor’s sending and
receiving messages and is another measure of network structure. [7] It is
calculated from the ratio between the sent and received messages according to
the following formula:
This produces values between -1 and +1, where -1 is a person who only receives
messages, and does not send any messages, and a +1 is a person with whom it
is just the reverse. In an ideal network, each actor sends about the same amount
of messages as they receive, but in practice, this can vary greatly from person to
person.
2.3. Text Analysis
In addition for metrics that focus on the actors or nodes of a network, metrics can
be computed for the edges or the content of the communication, which
represents the connections between the nodes. For text, or a parsed edge the
following measures can be computed:
• Number of recipients
• Length
• Sentiment
• Emotionality
• Complexity
• Response time
• Influence
For each actor or node, the average value of these edge data can be determined,
and in certain cases, the sum of values.
16
Social Network Analysis
2.3.1. Metrics for the analysis of text
For a single text, some values can be relatively simple to calculate, others are
more difficult. Of interest in this study are the following text metrics: length,
sentiment, emotion and complexity. Note: for computer performance reasons it
make sense to calculate a measure only once for a message, when multiple
recipients occur on the same message and just copy that result to the other
recipients.
Number of receivers
The number of recipients is not a metric of the text itself, but may be crucial for
various comparisons. It is quite possible that some people do not use the same
communication style when they talk privately with an individual or when they talk
to several people at once.
Length
Text length is the number of characters of a message. On Twitter, there is a
maximum number of 140 characters.
Sentiment
Sentiment is a value between 0 and 1, where a high value is positive and a lower
value indicates a negative sentiment. It is calculated from a specially developed
multi-lingual classifier based on a machine learning method with data from
Twitter. The analysis makes it possible to classify texts in the languages of
German, English, French, Spanish and Italian and was trained using a total of
about 200 million tweets with negative and positive emoticons. The exact way it
works is detailed in the paper "Multi - Language sentiment analysis of Twitter
data on the example of Swiss politicians." [8] The principle is based on an idea
by Patrick de Boer and other research studies. [9] [10]
Emotionality
Because sentiment is calculated as the average of the whole text, sometimes
information can be lost. For example, the following messages are both classified
as approximately neutral:
"I bought a smartphone today."
"I love my new phone sooo much! The old one was terrible."
In the second text, the positive and negative statements are roughly in balance,
but the text is much more emotional. Such texts can be distinguished from one
17
Social Network Analysis
another by the value of emotionality. The Emotionality value is calculated as
follow:
Complexity
The complexity of a text can be measured in several ways. Often when
calculating the complexity of a text, the sentence length plays a crucial role, but
this advantage creates a problem when using multiple languages. A solution is
to use the 𝐷 𝑇 𝑤 𝑖 𝑡 𝑡 𝑒 𝑟 training data set used already for the calculation of
sentiment and emotion.
This source contains the information about how often individual words in the
different languages were used on Twitter in combination with positive or negative
emoticons. For example, the word "Internet" was used 1,182 times in a total of
1,600,000 German tweets.
It is possible to determine roughly the probability 𝑝(𝑊𝑘) for a single word to
appear in a random tweet. The logarithm of the inverse of this probability log [1/
(𝑝(𝑊𝑘)] just equals the value IDF (Inverse Document Frequency ), which is very
well suited to the rarity of terms in a corpus of documents to describe.
The higher the average of the IDF values for all words of the text to be tested, the
more simple the text. For texts in which more words are included, which occur
less frequently on Twitter, the value is correspondingly smaller.
The average Tweet in German corresponds approximately to the value 7.1 with a
variance of 1.37.
Response time
This value is specifically intended for e-mail networks. It measures the average
time it takes for a response between the sender of a message and a response by
the receiver.
18
Social Network Analysis
A similar value can be calculated for "turn taking" which is the average number of
messages a sender answers. Note: the “turn taking” metric was not implemented
in Condor 3 at the time of this study and thus cannot be considered further in this
report.
Influence
Influence attempts to classify the importance of a message to each actor within a
network. Influence takes into account whether a receiver reacts after receiving a
sender’s message in any form, or changes his or her behavior. For example,
influence measures when a recipient uses a new word for the first time, just after
the receipt of a message. A person who retweets or forwards a message from a
sender counts as a "successful" influential message for the sender. The stronger
or faster this reaction is by recipients, the higher influence value is for the sender.
The influence measure and its exact calculations are described in Section 2.4 in
more detail.
2.3.2. Advanced Text Analysis
Depending on the network data, additional steps may be necessary to clean the
text data for analysis. For example, it may be necessary to remove
HTML/formatting as well as remove email chain messages to confine the
analysis to the most recent or last message.
Remove HTML / formatting
Removal of HTML is used in this work using the Java HTML parser Jsoup .
Recognize Latest posts on e -mail communication
Most emails contain not only the latest message, but also the chain of the
previous communication between the recipients. Thus, for example, located in an
edge from person A to person B, is the following previous communication:
Yes, Friday suits me fine.
On 03.05.2013, at 11:33, Person B wrote:
Do you have time on Friday for a meeting?
The text of Person A consists of only five words (“Yes, Friday suits me fine.”), but
saved is also the previous correspondence. Thus, it is necessary first to separate
19
Social Network Analysis
A’s message from B’s, along with the information as to when and by whom the
original email was sent.
The problem is that each mail program has its own formatting conventions to
separate the new from the old message. A perfect detection rate is thus very
difficult and requires correspondingly good training data. A manual generation of
a regular expressions ( regex3 ) for every conceivable variant ensures the
greatest possible detection rate and can cover over 95 % of cases. The following
are some of the examples in which the detection is possible with this system:
•
New content is here
On 03.05.2013, at 11:33, Person B wrote:
Original message is here
•
New content is here
On Wed, November 6, 2013 17:57:06 CET wrote Person B
Original message is here
•
New content is here
From: Person B
Posted:
Original message is here
•
New content is here
From: Person B [mailto: b.person@fhnw.ch]
Sent:
Original message is here
•
New content is here
---- Original Message --Original message is here
In these examples, using regex terms, such as, with the phrase “New content is
here” enables the ability to parse the data and select just the most recent text for
analysis. Also, a similar regex step can remove typical signatures such as "sent
from my iPhone" and the like, which are not adding value for the Influence text
analysis here.
3
A regular expression (abbreviated regex or regexp) is a sequence of characters that
forms a search pattern, mainly for use in pattern matching with strings, or string
matching, i.e. "find and replace"-like operations. See:
http://en.wikipedia.org/wiki/Regular_expression
20
Social Network Analysis
Treatment of hyperlinks
Hyperlinks can be relatively safely removed with simple regular expressions. In
most cases, it is advisable to completely remove hyperlinks from the text, unless
it is of interest to trace the exact distribution of these links in the network.
The following fictional Tweet serves as an example in which it is better not to
include the hyperlink, at least for the calculation of sentiment and emotion into
consideration:
This is the stupidest thing I’ve ever read! www.xyz.ch/gold-makeshappy.html
The text is fundamentally negative, as the author of the statement in the linked
article sneers. The words in the hyperlink (“gold makes happy”) are much more
positive and a calculation of the Sentiment would result in an approximately
neutral or even positive text.
Identifying Keywords
For the calculation of the Influence value and for different visualizations, it makes
sense, to identify the most important or keywords words. There are various
possibilities, two of which are described as follows:
Identifying keywords with regular expressions
It is relatively simple to remove from the text by regex for all non-word
characters. A corresponding regex is thereby ideally defined as the set of
all characters contained in the observed text, which are three letters or
less.4
Another method is to construct a “stop-word” list of those words deemed
unimportant, such as the articles, “a”, “the” and “an”, to be automatically
removed or excluded from the text analysis, which results in a relatively
uncluttered keyword list.
Identifying keywords with OpenNLP
A slightly more elaborate method uses the Java Library OpenNLP. [11] In
this way it is possible to first divide text into sentences and then to
subdivide these again to identify keywords. Instead of a stop word list, it
4
However, a separation at each \ s or (white spaces) or to any \ W (no word characters)
is often too imprecise and should be further refined
21
Social Network Analysis
makes sense to let OpenNLP determine the noun in the text and only use
these as the keywords.
Normalization and stemming
The collected Keywords are often available in different circumstances and with
different notation. So that this does not cause problems, the words can be
simplified in advance through the processes of normalization and stemming.
Normalization is this process, where umlauts ÄÖÜ are replaced by the vowel of
an appended “e”. In addition, stemming can be applied, which reduces words to
their root word. For example, the word "categories" can be shortened to
"categori". However, a stemmer in the appropriate language needs to be used
and it should be noted that stemming does not impact visualization.
2.4. Influence and Innovation
After all the preparatory work, the crucial task can be approached to determine
which messages are really important. This information is used to measure the
influence of individual actors on the overall network. First, “influence" needs a
more precise definition. The exact definition is then mathematically described
and explained with an example.
2.4.1. Definition of influence
Different applications use different definitions for the term "influence". Here are a
few examples.
In Facebook networks, it is possible for example, to analyze the distribution of
links, or even Facebook games. For example, a person was affected when she
starts to play the game after another person has previously sent the automated
advertising for the game in network of friends. [12] [13]
The so-called Klout Score calculates influence based on the number of followers,
frequency of retweets and some other factors. [14] Unfortunately, the exact
calculation is proprietary, and thus cannot be compared with other values .
People can also be influenced to join a network [15], or to write about certain
subjects. [16] The concept of homophily, where persons communicate with those
who are similar to themselves has found to exert influence. However, influential
people can break this barrier, so that for example, despite large personal
differences they follow very heterogonous people on Twitter. [17]
22
Social Network Analysis
Despite these various definitions of influence, each is trying to measure whether
a person can cause a certain behavior change in their environment. Often, this
behavior is directly visible in the communication network, for example in the form
of retweets, new discussion topics or changes in the structure of the network.
Observable behavior changes can be found in the language used by people of
the communication network, which changes over time. An influential person is
able to introduce new ideas, beliefs, and behavior patterns, and this can be
summarized with the term "meme", which was introduced by Richard Dawkins.
[18] These memes become visible in messages within the communication
network.
Thus, Influence here is defined as the amount of new terms, concepts, and ideas
which a person has introduced into the network and which are subsequently
used by other members of the network. The following chapters describe how to
measure a person’s influence within networks of communication.
2.4.2. Changes in a person’s word histogram
A first approach for the calculation of Influence consists of the analysis of the
average language use for each actor within a predefined period. Thus, a word
histogram of the most commonly used words is saved as a vector for each actor
or node.
If a person writes new messages, this vector changes continuously. It is now
possible in fixed time intervals to check how the vector of individual actors has
changed and to measure the difference. If the difference for an actor X is known,
then it can be compared with the vectors of all the other actors, of which actor X
previously received messages. If actor X for example, was influenced by actor Y,
it is possible that X now starts to use words, which could be attributed to Y’s
previously messages.
Thus, a measure can be derived for the Influence that actor Y had on actor X.
This can be calculated from the change in the angle between the two vectors. It
should be noted that this procedure should not consider any stop words. It can
also be useful to analyze the words that are in the messages of Y to X.
A major disadvantage of this method is that various random factors may distort
the Influence measure. The language use of an individual is constantly changing,
and often depends upon human interaction. Particularly problematic is when two
people interact often, because with this procedure, it is unclear who actually
influences whom. This specific case was not pursued because of the large
amount of random effects present, but is acknowledged as a procedure limitation.
23
Social Network Analysis
2.4.3. Influence of individual messages
While the method in the previous section directly examines the people, the focus
can shift to the messages themselves and their effects on the recipients. The
Influence metric compares the sent message with the next messages sent by the
receiver within a defined arbitrary period r. The prehistory message exchange of
the sender and receiver are not currently considered in this situation.
When viewing a message d from person A to person B the following occurs:
All messages sent in the next few days to , by the receiver of the message d,
are analyzed and compared with d. The difference in time
determines
how strongly incorporated the message in the calculation is according to a linear
scale, defined as
, where r is a period of 4 days.
Both 𝑑 for the message and for each message
to
is now the TF-IDF value
where, the terms TF and IDF refers to Term Frequency, and Inverse Document
Frequency, and are calculated for all terms as a vector for all sent messages in
the network. The Social Network Analysis 𝑣𝑑 = [𝑤1,d, 𝑤2,d, ..., 𝑤N,d] T is stored.
This vector allows the similarity calculation according to the Vector Space Model
[19] as the cosine of the angle between various messages
The influence of the message 𝐸(𝑑) now calculated from the sum of the
similarities between 𝑑 and the subsequent messages 𝑓𝑛 with the temporal
scaling factor 𝑧 (𝑑, 𝑓).
24
Social Network Analysis
The overall impact of person A is the sum of the values of all the influence of
their sent messages. The effort to calculate this depends on the amount
messages sent by person A, and the average activity of the recipients of such
messages.
The problem with this system is that we are unable to verify this assessment with
the persons being studied. It is entirely possible for the recipients of the words in
the message sent by A, to have been using the words for some time, prior to A’s
original message in our data set.
There is also the risk that two people, who are often talking within the network to
one another, are using a very complex language or vocabulary. In this case, a
few people are using a specialized vocabulary, which is very different from the
rest of the network.
2.4.4. Influence - combination of: similarity, relevance and time delay
The system described in the last chapter already works very well in many cases.
However, the complexity of text5 plays a primarily role, when making a
comparison between the transmitted message and the subsequent messages of
the recipient. Every person has a unique use of the language, which will also
change over time. This is especially the case when using a person’s name, which
may occur in every message before on a greeting line. There needs to be a
system, which checks for the previous use of terms.
The basic principle of Section 2.4.3 remains available, but it also checks whether
the concepts introduced are in fact new for the recipients or whether recipients
already have sent messages with similar content. In addition, it may be noted
that certain terms may not appear until later in the network and thus the
importance of TF - IDF calculation is not ideal from all messages of the network.
Instead, fall back on the known record of Twitter (𝐷𝑇𝑤𝑖𝑡𝑡𝑒𝑟), thanks to which the
frequency of individual words is known. As mentioned in Chapter 2.3.1, the value
log [1 /(𝑝 (𝑊𝑘)] will be also denoted just the IDF value within 𝐷𝑇𝑤𝑖𝑡𝑡𝑒𝑟, and could
5
See section 2.3.1 Metrics for the analysis of text
25
Social Network Analysis
also be used instead of Complexity as 𝐼𝐷𝐹 𝑇𝑤𝑖𝑡𝑡𝑒𝑟. The value of W 𝑡,𝑑 has therefore
been changed and is no longer dependent on the context.
As an alternative without
the IDF value from the documents around the
time of d could also be used, but it may be unfavorable, if not a lot of text is
present or just a period of time is taken in which the analyzed term was overly
important. Ideally each of the IDF term values would be calculated from the
largest possible number of documents from messages of similar communication
networks.
The crucial change in the formula, however, is the inclusion of the relative
importance of the language of the recipient. If the recipient has previously used
very often all the terms 𝑑 in a message, then obviously there is no influence here.
This factor importance(d,f) determines whether the similarity between two
messages is relevant.
Now when d compared with a subsequent message, then calculates the
individual influence of the product of time factor, similarity of messages and
relevance of this similarity. This results in the final definition of "influence" of a
single message is as follows:
In summary, the Influence of a message is dependent on three factors: similarity
to the subsequent messages sent by the receiver, relevance of this similarity and
the difference in time between receipt and response. The combination of these
three criteria allows, regardless of the type of network, the ability to accurately
determine the extent to which a message has caused changes in the behavior of
the recipient.
26
Social Network Analysis
2.4.5. Example
The following simplified example illustrates the procedure. In this network, four
members of a team exist, called A, B, C and D, as well as an expert E. Since
everyone communicates with everyone, then the values of centrality for all nodes
are exactly the same, and therefore offer no conclusion as to who in this network
is more important. Using text analysis, a new value, “Influence”, will now be
created to ascertain when who is more important in a message exchange over
time.
For this example, assume that the words "problem" and "solution" are generally
rather rare occurring words. The time between each message is identical in each
case and there are no more messages on the network. In the visualization, the
more influential people are marked with corresponding darker blue color.
1 Person A asks for the project members for help. The text that A sends to
the project members B, C, and D, is: "I have a problem."
2 Person B decides to ask the expert E for help, with the E -mail "We have a
problem."
3 The message from A to B has thus influenced B, as the text "a problem"
exists in both emails. Since B was not active in the network, these are
new terms for him and even if both occur frequently in the network,
"problem" has a high complexity. Person A has clearly influenced the
network at this point in time.
4 Expert E responds to B with the mail "I have a solution to the problem."
5 The message from B to E thus gained influence since E responded, with
the Word "problem" mentioned. B has now become more important in the
network.
6 B then writes a message to A, C and D. The content is " E has a solution"
7 The message from E to B is considerably influential, since B has just sent
three persons a message with the rare word "solution" which has been
learned from the last message from E. Person E has thus become the
most important because his message had the greatest influence on
communication at this point in time.
27
Social Network Analysis
8 Over the next few days, all project members in their internal
communication use more frequently the term "solution".
9 As the people A, C and D have received the term, “solution” from B, now B
is credited as the most influential person in the network.
10 In the short run, with each use of the word "solution" in the next few days, the
result is to increase the influence of other messages that have used the term
before. However, in the long run, as time goes by from days to weeks, there
is less influence because of the longer time factor, as well as, because now
all have already used this term.
2.4.6. Tracing Innovation through text over time
The system described consists of exchanges of text between a network of people
and the initial identification of influential or keywords and who uses them over
time. In the example of Section 2.4.5, this would be the word "solution" and to a
lesser extent also the word "problem." These two words generated the greatest
influence and thus are interesting for further analysis.
To find the really important words, it is necessary to determine the individual
influence I of a single term t. For this purpose, the first step is to create list of all
influential words occurring in all messages, and then the individual importance
values, are composed of the product of the IDF values summed, where N is the
total number of messages, and K, the number of messages sent by the receiver
in the given time window of 4 days after the receipt of a message.
28
Social Network Analysis
Using a visualization of the words with the highest importance value, trends can
be observed. As might be expected, in the network of the example of Section
2.4.5, the words "solution" and "problem" result as the most important. It makes
sense to use these words for further analysis, for example for the discovery of
who are the typical early adopters according to the definition of Everett Rogers
[20] or the development of the importance of individual words over time.
29
Social Network Analysis
3. Implementation and Evaluation
The system theoretically described in Chapter 2 has been implemented in the
Condor 3 software. Users can test the system on any communication networks,
which can be analyzed by Condor 3. This chapter describes the basic Condor
operations and how the “influence” metric has been integrated into the software
along with the results of four practical test applications. These consist of two subnetworks of Twitter, and two e-mail networks. These four cases provide a
realistic description and evaluation of how the “Influence” measure can be used
productively.
3.1. Implementation in Condor
While several of the functionality described in this section was already
implemented in Condor 3, others had to be developed from scratch. This chapter
describes the basic fundamental capacities of Condor and their application. In
addition, all new additions to the analysis of the text are described in detail,
together with its integration into the software.
3.1.1. Functionality of Condor
Condor 3 is a software tool for the analysis of social networks. It includes several
"Fetchers", which are used to collect and store data from a variety of Internet
sources, as well as a "Content Processor," which can prepare and enrich the
data for analysis and visualizations.
The current Condor version 3.0.a8 has a Fetcher for email, Wikipedia, Twitter,
Facebook and the Web. The collected data is stored in a common format in a
MySQL database. The saved data identifies actors in the network and their
connections, along with selected node and edge attributes. The actors appear in
each case as nodes on a graph and the links represent the edges.
With the help of content processing additional attributes can be computed and
used to supplement the original collected data. For example, the Natural
Language Processing algorithm can be used to analyze the text of a message
and calculate its sentiment and stored this value as an attribute in the
corresponding message’s edge.
Condor 3 has a choice of three views for visualization: a static view, which gives
an overview of the entire network; a dynamic view, which also reflects the
network, but also takes into account the time dimension; and a word cloud view
of the corresponding text content used within the network. [21]
30
Social Network Analysis
Figure 7 Screenshot of the user interface of Condor 3
Figure 7 shows a typical screenshot of the Condor 3 user interface. On the left
side appears the name of an open dataset signified by a green check mark; in
the middle, there appears a static view of a network and set of graphic controls
above it to manipulate the graph; and last on the right side, there is panel of
options to change the appearances of the nodes and edges. The Fetcher and
analysis steps are accessible via the menu bar. The menu consists of seven
items: File, Fetch, Process dataset, View, Analyze, Export and Help as well as
numerous sub menus for processing.
A typical use of Condor is to first create a new database and then create a new
dataset and collect data using a selected fetcher, or merge two or more existing
datasets. As soon as the data is loaded, it is available to be viewed, processed,
filtered or new data values computed. The View menu is used to display and
analyze one or more datasets. Depending upon the users needs, further
processing steps can be performed or key network metrics can be exported
including the dataset itself using the Export menu.
31
Social Network Analysis
3.1.2. Tools for Analysis of the network
Under the menu item, "Process dataset" are located the options that can edit the
selected dataset. A process wizard is used to assist the user in completing a
possible series of forms to complete a desired task or menu selection. The
following “Process dataset” submenu items can remove actors and edges from
the dataset or together according to various filtering criteria:
• Graph Pruning
• Node Merging
• Sample
• Remove Specific Actors
• Actor filtering by properties
Three remaining menu items retain the existing actors and annotate these with
new values calculated using methods of textual analysis. Required for these
three text calculations is the presence of readable text, which is stored in a
dataset’s edges as the messages between two or more actors.
• Language annotation
• Calculate sentiment
• Calculate influence
In this work the latter two options are very relevant. The module for sentiment
calculation was considerably expanded and separated from the language
recognition or annotation module and was completely rewritten to incorporate the
calculation of the new Influence measure.
Language Recognition and Annotation
The step "Save annotation" can stand for any properties of nodes and edges,
which are stored as text, to detect the language in use. On Twitter, for example,
this could be the description of the people themselves; and for each Tweet,
identifying the language of the actual message. The identified language is stored
as a new field with the value of language [<field>]. For example, if the content of
the tweets are selected for language annotation, there is a new field with the
name language[content] inserted and stored, which is the short form of the
language recognized under the ISO 639-1 standard [22].
For Language recognition and annotation, the Java Library "Language Detection"
by Shuyo Nakatani is used, which can recognize 53 different languages.
However, [23] messages or tweets on Twitter, due to their brevity, have a slightly
32
Social Network Analysis
lower recognition rate than longer texts, and thus do not always reach the 99 %
detection rate, which, according to Nakatani is theoretically possible.
Sentiment Calculation
Calculating the sentiment assumes that the language of the text to be analyzed is
known, and therefore the step of the language recognition of the text is carried
out in each case automatically, if the data is not annotated accordingly.
Note: some text may be ignored because it is not in the desired language in
order to refine the results.
Figure 8 The Sentiment mapper default settings automatically use the most likely fields for
analysis.
Figure 8 shows the wizard for the calculation of Sentiments with the default set of
options selected:
Language of the dataset fields:
In addition to the automatic language detection, users have the option to select:
German, English, Spanish and French. If a language is selected, the wizard
ignores any text that does not correspond to the selected language.
Edge / Node Property to analyze :
Here you can choose the field that is to be used for text analysis. If a field is
present, that contains the word "content," then it is automatically selected.
33
Social Network Analysis
Activate chunking :
If this box is checked, then the OpenNLP module is used, to "chunk" or join the
text of two or more words together. For example, the occurrence of first and last
names would be joined or chunked into a singe name. This only has an effect on
the visualization in the Word Cloud View.
Use OpenNLP :
If this box is checked, the separation of the individual words and phrases in a text
are performed using the OpenNLP modules. Alternatively, the sets can also be
split operations using simpler regex.
Add Complexity / Emotionality :
The two values complexity and emotionality can be added in this step when this
option is checked.
Execution of Sentiment Calculation
First, a preprocessing treatment of all existing text is performed quickly. This step
removes all existing HTML code in emails, and selects the top most recent
message and filters out blank messages. The wizard checks in the next step,
whether language information is already available, and if not, adds this if needed.
This is followed by the calculation of sentiments, as well as complexity and
emotionality, if the user selected this option. The exact calculation of sentiment in
Condor is described in the documentation of another project [24] and is therefore
not discussed here.
For performance reasons, multi-threading is used, where each thread analyzes a
portion of the data. A crucial step, which also increases the performance
significantly, especially in e-mail networks is when multiple edges are present,
which have the same text content, such as when an an e-mail has multiple
recipients of the same message. It is useful to analyze such a message and then
copy the results to the other edges.
34
Social Network Analysis
After the wizard successful completes its processing and analysis for each edge
and each node, then the following new nine data fields are appended to the
original source data:
• Language
• Sentiment
• Emotionality
• Complexity
• Keywords
• Sentiment Words
Note: If only edges were analyzed, then all nodes will get these fields appended:
• Average Sentiment
• Average Emotionality
• Average Complexity
The Sentiment Words field is used only for visualization as a word cloud in the
Word Cloud View and includes the words used and the sentiment of the context
in which they were used. The prior version of Condor included only this field,
along with language and sentiment, while the other options in the course of this
work were added.
It is worthy to note the importance of the Keywords field, because it contains the
most used words and how often they appeared in the analyzed text. This enables
the ability to calculate the Influence measure without having to again use the raw
data of the text.
Influence Calculation
This module is used to calculate the influence measure according to the
description in chapter 2.4. Like the sentiment analysis, a process wizard walks
the user through the steps for the analysis of the text. The calculation of the
Influence assumes that for each edge to be processed the language annotation
already exists, as well as a list of words and their frequency.
35
Social Network Analysis
Figure 9 The Influence calculator menu
The options for the Influence wizard are shown in Figure 9 and are partly
identical with those of the sentiment calculation. The choice of language also
allows filtering to limit to a designated language and it is also necessary to
specify the field to be analyzed. In this step, two different types of text edge fields
are allowed: 1) a field with readable content; or 2), a pre-processed text in the
form a map with word and frequency in the corresponding text. The latter is
preferable since it is then not necessary to call the corresponding step for the
sentiment calculation. If a field named "keywords[*]" exists, this automatically is
selected, otherwise a field with the name "content", is selected if one exists.
Create new dataset, is an option available to check, when only edges are present,
in which an actor had an impact on another actor. This can be used to better
identify who is influencing whom. The name of the new dataset can be specified
manually, which is proposed as a default, with the name of the existing dataset
with the string, "_Influence" appended as a suffix.
The calculation of the Influence adds five new fields to an actor’s record, which
arise from the analysis of textual data. The following fields are created:
• Total Influence
• Average Influence
• Messages Sent
• Most common words
• Most influential words
The value "Messages Sent" indicates the number of sent messages. Note:
messages with multiple recipients are counted as a single message. The
Influence 𝑑 of a single message is calculated exactly according to the description
in Section 2.4.4 as follows:
The values of Total Influence and Average Influence are then calculated from the
values for the out-going edges of this actor.
36
Social Network Analysis
In addition to numeric values, two new textual values are added, which illustrate
the use of language by listing the 5 most common and the 5 most influential
words.
3.1.3. Visualization of a Network
A Condor 3 network has the same fundamental format, but the properties of the
individual nodes and edges may differ depending on the network type. The
nodes can either be an e-mail address of a person, a Facebook account, a web
page, a Wikipedia article, or even a combination of all these nodes. The edges
represent a connection between two nodes, which may be an e-mail message, or
a single link between two web sites or Wikipedia articles.
This thesis deals primarily with human communication networks. It can therefore
be assumed that a node is a real person, and the edges are interpreted as
messages between people. Thus, the edges always have a text in the form of
tweets, a direct message or something similar.
Figure 10 The Fruchterman - Reingold algorithm provides a useful visualization of the graph.
For visualization of the network, there are various options regarding the visual
representation. Condor uses the Fruchterman - Reingold [25] algorithm to graph
a dataset, which repels each node in a spring-like fashion. As in figure 10 can be
seen, this allows also, in certain cases, the detection of clusters or teams.
Condor enables a user to visually change the size, color or shape of a node,
depending on a predetermined characteristic or variable of the actors. Thus, the
actors can be distinguished in the network map according to their influence and
important actors can be visually found much faster. In addition, nodes can be
37
Social Network Analysis
colored according to different criteria, as shown in figure 11. The node size in this
Figure 11 is dependent on the amount of messages sent to a respective person.
Figure 11 The same network as in the previous figure benefits of different options for
representation.
There is also the ability to display to the node or edge labels, which, for example,
describe the name or other essential characteristics. Note: for datasets with a
large number of nodes this display option can make a graph unreadable, which
would be the case with Figure 7.
3.1.4. Detailed nodes and edges information
The calculation of sentiment and influence enable new options in the
visualization of the network. If both analyzes are run, then 6 new edge fields and
8 new node fields are created, which can be viewed in the static view.
The 6 new edge fields are:
1. • Language
2. • Sentiment
3. • Emotionality
4. • Complexity
5. • Keywords
6. • Sentiment Words
38
Social Network Analysis
The 8 new node fields are:
1. Total Influence
2. Average Influence
3. Messages Sent
4. Most common words
5. Most influential words
6. Average Sentiment
7. Average emotionality
8. Average Complexity
The new edge and node fields can be clearly displayed by right-clicking on a
node or edge and selecting either "Node Details " or " Edge details ".
Examples of this can be seen in Figure 12 for Node Details and Figure 13 for
Edge Details.
Figure 12 details of a single node after analysis
Figure 13 details a single edge after analysis
39
Social Network Analysis
More importantly, the analysis in the static view, either by setting labels at the
nodes or, even better, by the node size of one of the new numerical values is
conditional. This makes it possible to recognize the influential actors of the
network directly, as the example of the visualization in Figure 14 shows. The
results of this network are considered in detail in Chapter 3.3.
Figure 14 The Influential persons are readily apparent in this network.
3.1.5. Diagrams
Under the View menu, there are options to create word cloud as well as several
diagrams to visualize the data. In addition to the already existing graph to
visualize the contribution index, five more measures have been added, which are
based on the text analysis: They are: Activity, Sentiment, Emotion, Complexity
and Influence
One of the new graphs combines the representation of four values: Activity,
Sentiment, Emotion, and Complexity. These values were calculated beforehand
from the analysis of the text contents in the network. If no values have been
calculated, an appropriate error message, as seen in Figure 15 appears.
40
Social Network Analysis
Figure 15 Error message appears when no text analysis has taken place.
To create the new text diagram, under the View menu select "Sentiment over
time" which displays by default the values of Activity and Sentiment. Users have
the option to check Emotionality and Complexity as additions to the diagram,
which, if selected, the axis’s are automatically adjusted for the selected data.
Figure 16 represents an e-mail network in a time frame of 10 days, which has
large variations in the week with respect to the activity and had relatively stable
Sentiment.
Alternatively, such a diagram also can be open through the context menu of a
single node, thus making it possible to observe the properties of the selected
actor. In this case, the value "Influence" can be visualized in addition, if
previously "Calculate Influence" was computed.
Figure 16 sentiment and activity represented as a graph.
Depending on the period of the data set, the following chart will automatically use
the appropriate time units according to the scheme in Table 2.
Minimum
0 Seconds
30 Minutes
1 Day
10 Days
Maximum
30 Minutes
1 Days
10 Days
Unlimited
Selected unit
Seconds
Minutes
Hours
Days
Table 2 time units are dependent on the existing period.
41
Social Network Analysis
For a better representation, the data are smoothed via at least plus and minus q
time units of time. q is at least 1, but the maximum number of available time units
divided by 50 (rounded down). For example, in a period of 200 days Activity is
presented as average number of messages per day, smoothed over 9 days, or 4
days before and 4 days after.
The diagram’s data can also be exported as a Csv file and explored further in
other tools, such as Excel. The exported data are also smoothed in this case, so
that the representation does not deviate. However, an export of the unsmoothed
original data is planned for a later version release of Condor and will be
accessible via the Export menu.
Word Usage
Figure 17 is an example of a Word Usage Graph for a Twitter network, in which
the four most important words: - SVP, Switzerland, initiative and @1zu12 - are
examined in more detail.
The graph shows the frequency with which the words are used.
This is specified as the average number of messages per day for each
corresponding word. The table also shows the most important facts about each
word, notably the following:
•
•
•
•
•
Number of mentions of the word
Number of persons, who used the word
The person who used the word for the first time
The person who had the highest Influence on the spread of this word
" Early adopters ", or the first 15% of people who used the word [20]
42
Social Network Analysis
Figure 17 Overview of the most important words of the network and its dissemination
The diagram is particularly useful when data for a network over a long period of
time is available. Such datasets enable users to observe who are the Influential
people on what topics or issues over time. The data is smoothed to provide a
sensible representation.
43
Social Network Analysis
3.2. Test Data: Overview and Treatment
This chapter provides an overview of the test data used to analyze the results of
the calculation of the Influence measure and describes how the data was created.
The test data should provide, among other information as to whether it is
possible with the computation of the influence the ability to find "important"
people in a network.
The communication medium Twitter is very well suited for the collection of test
data, as it can be used for various purposes and is very accessible. In this work,
the question arises: What topics most benefit from the visualization of the
influence of individual persons? A realistic application area could be in marketing
area, to find important people who could convince others to use a particular
product. Another potential application could be in journalism, for reporters to find
a relevant informant or news source in order to get crucial information on a story.
In both cases, a possible dataset could be created based on a small private
Fetcher for people on Twitter, in which connections between the actors are direct
communication by means of an @username, as well as, a Retweet, if created.
The network of Twitter "followers" is less favorable to collect and analyze
because this can quickly become unreasonable in size, and often does not mean
in such a relationship, that the person actually reads the tweets of another
person.
Another example could be data for e-mail networks, such as the data collected
from project teams would have the potential to be quite useful, especially if
additional information is known, such as the performance of a team or
hierarchical structure of departments, etc. Another data source could be
personal e-mail networks, to show an individual user, who is especially important
in his or her personal environment.
3.2.1. Twitter: Politics
In a prior work [8] [26] by the author, a way was described to analyze the
behavior of Swiss politicians in Parliament. In this network, the Twitter Accounts
of 71 MPs of the Parliament Building are included, but lacks individual direct
communication with other politicians are not visible on the network. Taking into
account both the National Council and Council of States, as in the case of
Federal Alain Berset. For the analysis of up to 200 tweets per person during the
period 28 November 2013 to 28 January 2014 were used.
44
Social Network Analysis
Overview:
• Total number of actors: 1,433
• Investigated persons: 71
• Number of tweets: 9,936
3.2.2. Twitter: BMW
In order to measure the success of a brand, various methods can be employed.
Very often Sentiment analysis is used to gain an overview of the opinion of the
crowd, [27] but often it is also useful to look for people, which the brand is able to
successfully get people talking about it. The question is whether the value of the
influence in this typical cool hunting task may be helpful?
As an example of such an application, tweets are to be analyzed in a short period
about BMW. For the creation of the data set used, the normal Twitter Fetcher,
which collected for 8 hours tweets with the search term BMW. In addition,
persons composed of 2,887 found, 50 Twitter accounts were within that data set,
determined in which in recent tweets the word "BMW" most frequently occurred.
This was in particular the official accounts of BMW, as BMW_Deutschland or
BMWGroup, etc. Of these people the last 20 tweets were collected to further
expand the network and strengthening the network in each case.
The case should serves as an example for an analysis of a short time span to
gain insight into the Twitter world around a discussion of a brand.
Overview:
•
•
Total number of actors: 2,887
Number of tweets: 5,964
3.2.3. E -mail: COIN Seminar
In the fall semester of 2013, a course was held in several universities on the
theme " Collaborative Innovation Networks ", in short COINs2013. This involved
students from five universities: MIT, SCAD, Aalto University, University of
Cologne and University of Bamberg, who participated at the same time in the
course using their local language. Cross-university, time-zone border project
teams were created who worked together for the term/semester. This seminar
has been conducted for several years and offers an ideal opportunity to examine
e-mail networks, which has already been collected several times on numerous
projects. [28]
45
Social Network Analysis
A special feature of this course is that the students use the Condor software
including the author to analyze the e-mail communication within their project
teams. All messages are cc’d to a dummy e-mail address throughout the course.
Thus, there is a separate Gmail account in which all, or almost all, the messages
are stored, which have been sent among the student teams, instructors and
teaching assistants, which result in an interesting communication network for
each project team.
The students involved have all agreed to use all their email messages for
research and analysis purposes. The data itself is anonymous in this report and
were kept confidential.
Overview:
• Participating students: 52, divided into 11 teams
• Total number of nodes: 130
• Total Posts: 1,563
• Total number of edges: 7,893
3.2.4. E -mail: Private E -mail accounts
A common use of Condor is the analysis of the personal mailbox, even if it is just
to learn to use the software. The advantage of such analysis is that a lot of
information of the network and the actors are already known and easily can be
compared with the results for face validity by the mailbox.
The system was tested with multiple private mailboxes of different people in
order to determine whether the collection of the most important people in the
network coincides with the subjective opinion of the account holder. In this report,
two networks are listed as examples, which are analyzed anonymously.
Overview:
• Available Actors: 2,215
• Total Posts: 1,311
• Total number of edges: 11,187
46
Social Network Analysis
3.3. Evaluations
The values listed in Section 3.2 networks were prepared with Condor and
examined. Here, the module "Calculate influence" was used as described in
Section 3.1.2 to find the most influential people in the network. Depending on the
network are appropriate dependent variables present in order to justify the
relevance of the value of “influence”?
An important question that must be answered in the evaluation of the data is the
following: Is the "Influence" measure more appropriate than the betweenness
centrality measure in its ability to identify the "importance" of a person? The
analysis shows that the definition of Influence in all four applications is very well
suited to obtain information about the relevant actors of the considered network.
3.3.1. Twitter: Politics
In Switzerland, approximately one-third of the Parliament has a Twitter account.
But while some are very interactive and involve many other people in the
conversation others are not so involved in the communication network. Figure 18
shows the coarse structure of the network, all parliamentarians are colored
according to their party and other people are represented in the network as
points. The structure of the network is not very informative, only the gross
distribution of the parties is impressive.
47
Social Network Analysis
Figure 18 Parliamentarians in this network are colored according to their party affiliation; other
Twitter accounts are drawn as points.
The crucial question is, which of the politicians are the "most important" in the
network? To answer this question a comparison of the politicians are compared
across four measures: influence, betweenness centrality, number of followers
and activity / interactivity. The latter example is from researchers of the blog
Some Polis – who have explored social media and politics, where “activity /
interactivity” was used as an important measure. [29] This is just one of very
many other scales used to assess the importance of people on Twitter, besides a
person’s propriety Klout score [30] or a combination of Retweet measurements
and mentions used. [31]
Figures 19, 20 and 21 each show a section of the network with the most
important politicians using the three scales: Number followers, Betweenness
Centrality and Influence. The size of the nodes depends on the chosen scale and
the top 20 are actors are noted by name.
Figure 19 Number of followers as a measure of the importance of politicians
Alain Berset has the most followers, as he is part of the Swiss Federal Council,
but he is relatively inactive. In contrast, Cédric Wermuth and Natalie Rickli, are
much more active and seem to have obtained a sizeable number of followers.
Christian Levrat and Pascale Bruderer with the fourth and fifth most followers
have a wide gap in their tweeting activity: 475 and 97 tweets respectively. This
suggests that the number of followers is mainly dependent on external factors,
i.e., political awareness, and not from the actual activity on Twitter, at least from
this sample. Influence may not be associated with the number of "followers"
which has been reported in previous work. [32]
48
Social Network Analysis
Figure 20 betweenness centrality as a measure of the importance of politicians
Figure 21 Influence as a measure of the importance of politicians
The Betweenness Centrality shows a different picture and favors those who
interact with many other people on the network, from either retweets or mentions
of user names. The list is topped by Aline Trede, which is connected to all 764
people in the network ( degree centrality ), which is remarkable given 851 tweets.
However, this is rather an outlier, other people with high betweenness centrality
tend to have relatively high numbers of tweets.
The third illustration shows the important people as measured by their Influence
score on the network. At first glance, some similarities with this picture can be
49
Social Network Analysis
seen in the visualization of the people with a high Betweenness Centrality. It
appears that people with high betweenness scores also have a high Influence
impact score. However, that is not always the case. Bernhard Guhl of the BDP, is
someone who ranked 3rd on Influence, but ranked 23rd on Betweenness.
The data can also be compared in tabular form as summarized in Table 3 for
which 20 persons are ranked by their influence score. In total, 57 people were
found in which an impact was measurable. The full table can be found in
Appendix A.
Table 3 List of 20 MPs with the highest Influence on the network of a total of 57 people.
Analysis
Is this list an accurate ranking of the most influential people on the Twitter
network? And, if so, why? What other evidence would serve to corroborate these
findings? Even if these questions are very subjective, a few examples provide
support for the accuracy of the Influence scale.
Christoph Wasserfallem, ( Vice President FDP), is ranked 1st on Influence and
ranked 3rd on Betweeness. He has long been active on Twitter and is most often
retweeted by all parliamentarians. This has a very positive effect on both his
Influence score, as well as, on his Betweenness Centrality. In comparison,
Balthasar Glättli , (Green Party), who ranked 10th on Influence and 6th on
Betweenness has almost three times as many tweets, but these tweets are rarely
retweeted. Kathy Riklin (CVP), who is ranked 2nd on Influence and 14th on
Betweenness is a little less active, however, she writes about a very broad set of
50
Social Network Analysis
topics; often communicates with other parliamentarians; and has more than 40 of
them follow her on Twitter, which sets her apart as a front runner. Bernhard Guhl
(President of BDP Party in Aargau), who ranks 3rd on Influence, is low on
Betweenness with a rank of 23rd. He was less often retweeted, but nevertheless
has a far amount of Influence on the network by this calculation. This was done
by being credited with spreading the hashtags # stk2014, which stands for the
Electricity Congress 2014. Tweets with this hashtag were often retweeted.
The table also shows that 9 of the top10 people who scored the highest on the
Influence metric also are among the top 20 on the Betweenness Centrality list.
The only exception is Bernhard Guhl, who ranked 3rd on Influence, but low on
betweenness at a rank of 23rd.
Another noted outlier is Mathias Reynard (SP), who ranked 47th on Influence and
9th on Betweenness. This can be explained with the fact his tweets are entirely in
French, and most German-speaking parliamentarians less often retweet foreign
language tweets or communicate across the language border.
Hardly any correlations can be observed between the number of followers and
the other two values of Influence and Betweenness Centrality. This is also
reflected in the fact that some politicians like Jacques Neirynck (CVP) already
have large number of followers without ever having written on Twitter.
3.3.2. Twitter: BMW
The analysis of the topology of the Twitter: BMW network provides an immediate
insight that there are two subnetworks where people tweet each other, while
most people have no direct connections to the other. In Figure 22, two
subnetworks are shown. Tweets in the green colored area are written in Spanish,
and the network is centered around @BMWEspana. The larger network in the
blue shaded area includes only English tweets and has the account @BMW.
Also included in the entire network are more official accounts of BMW, but also
bloggers, reporters and BMW licensed dealers.
51
Social Network Analysis
Figure 22 The Twitter:BMW network indicated two subnetworks: the smaller green Spanish
network and the larger blue English network.
People in such a sparse network have correspondingly poor chances of gaining
Influence. Nevertheless, a calculation of the Influence does show a clear winner:
@ Ocean_BMW. Figure 23 shows a small part of the network in which this
account is visible. The Account Owner is an officially licensed BMW dealers in
Plymouth (UK ) called Ocean BMW.
Figure 23 is the node size depends on the influence of people. In this area of the network
Ocean_BMW seems to be very influential.
The analysis shows that Ocean_BMW could bring in particular, the terms
#plymouth and showroom into the network. It is no surprise that BMW Ocean
during the search date on 5 February 2014 just inaugurated a new showroom in
52
Social Network Analysis
Plymouth (UK) and it presented new BMW models. After Ocean_BMW initially
wrote about it6, consequently more tweets by others on this topic followed. This is
also apparent in the word cloud view of the discussion. In Figure 24 there were a
total of 152 mentions of @Ocean_BMW ( in the figure only as @ocean shown) in
which the usually comments were about to come and see the new showroom.
Figure 24 The word clouds view scales the Bigger the words with their frequency of occurrence in
the data set. Green words are positive, red is negative.
In contrast to the analysis of the politicians, the Twitter:BMW test data was
confined to a narrow time limit. The value of Influence measure is therefore
precisely to identify the most important persons who were in the observed period
of observation. This is a big advantage over other methods that would take into
account the total number of followers, retweets, etc. The example thus shows
that even in networks that can be created in a relatively short time with simple
means, Influential people can be identified with a few clicks.
3.3.3. E-mail: COIN Seminar
A simple structural analysis of the COIN course network can answer some
fundamental interaction questions without using any text analysis. Specifically, to
what extent did students engage in cross-team collaboration? A simple
visualization using the Fruchtermann - Reingold layout algorithm provides a
picture of the degree to which team members were networked together.
But if we consider only the connections themselves, important information
remains hidden. For many teams, members email everything to everybody and
consequently their centrality values are very low or identical. Figure 25 shows an
example of a project team with 6 members, with the size of the node scaled to
reflect their Betweenness Centrality. The structure of the project team does not
6
Ocean_BMW mentioned the showroom in 6 tweets. The most successful of these can
be reached at: https://twitter.com/Ocean_BMW/status/430997170947252224
53
Social Network Analysis
reveal very much beyond the fact that each team member has at least once
communicated with each other and one member has a single connection to an
external person outside the project team.
Figure 25 A calculation of the betweenness centrality provides little new information about the
network
In contrast, a much different picture emerges when the messages are
themselves analyzed. A scaling of the node size as a function of the Influence of
the nodes leads to Figure 26. Here it is clearly visible that the people in the
network were not exactly equivalent, such as, for example, the Betweenness
Centrality value would indicate.
 Figure 26 shows the value of Influence, that there are certain differences between the people
of the project team .
54
Social Network Analysis
The question now is whether the person in the middle, did in fact exercised the
greatest influence on the project team? And, if so, whether this also applies to
other project teams. To investigate this, a survey was conducted of the project
teams, in which each participant in the COINs was asked the following question:
Who in your team had the greatest influence on the result of your project?
It was pointed out that the information will be kept confidential and it was optional
that respondents could nominate themselves. The survey was conducted with
participants of the current course of 2013 towards the end of the course and also
with the participants of last year's course, some 10 months after the completion
of the course. Table 4 shows a summary of the number of responses received.
As might be expected, response rates from participants of the course in the
previous year were relatively low.
Participant
Project Teams
Replies
Teams without answers
2012
52
10
12
4
2013
51
11
33
1
Table 4 Summary of responses to the survey regarding influence in the project team
In total, data on 16 project teams from 45 out of 84 people were obtained. Since
the question can be answered very subjective, the answers in most teams are
not unanimous. The data can be used as a comparison to the calculated value of
the Influence measure, but it must be noted that some uncertainties exist.
Nevertheless, evaluating the results of the participants' responses against the
calculated Influence scores for each project team does serve as valid quality
check.
The following Table 5 describes the project team 7 survey results from the
course of 2013, and lists the number of votes per person and the calculated
Influence score according to the analyzed e- mail network:
Vote
Person A
Person B
Person C
Person D
Person E
2
1
Influence
13.41
22.21
7.89
4.77
19.23
Number of Messages
143
188
100
54
198
Table 5 Example of the obtained data of a project team
Two people have voted in favor of person B as the most Influential person and
one person thought Person E was the most influential person. The calculation of
55
Social Network Analysis
Influence in the communication network reflects this situation perfectly because
person B has the highest Influence value, followed by Person E.
This analysis was repeated for the other 15 project teams, with the following
results:
• In 10 of the 16 teams the person who received the highest
Influence score also had the most votes
• In three other teams the person with the highest Influence score
received at least one vote.
• Only three teams showed no positive correlation between the
number of votes and the Influence score.
The full results are in Appendix B. Of the three teams in the last category with no
correlation, one of the project teams did not send all their communication to the
dummy gmail address. In the other two cases, the project members have
selected a person, which in the analysis of e -mail communication was only
moderately influential. Maybe for these two project teams, people have
contributed significant to the result of their team, but do not use email as the
dominant mode of their communication.
It should be noted that the teams from 2012 did have better matches between
their Influence score and number of first place votes. This could be related to the
higher amount of emails for these project teams. In contrast, the 2013 course
participants were stronger communicating via Skype and other chat-programs.
The example of these 16 project teams shows that the calculated value of
'Influence' often correlates very well with the subjective opinion of the people of
the communication network. An observation is that the analysis of the e-mail
communication network is not sufficient in every case to identify the most
influential people, and it may depend upon other circumstances outside of the
data collection. The value of Influence measure should therefore not be used, for
example, to make statements about the quality of employees or individual project
members contributions. Influence, may however, be a useful measure of the
analysis of the communications. The Influence measure does take into account
much more information than, for example, the Betweenness Centrality and very
often found the most important actors in a network.
3.3.4. E -mail : Private E -mail accounts
In the reviewing the topology of the private e-mail network, it is usually difficult to
get an overview. Often there is a more or less a star-shaped network, where
everyone is linked to each other. Figure 27 shows an example of a network of
student’s e-mail network.
56
Social Network Analysis
Figure 27 The betweenness centrality in this network has only a very limited explanatory power.
Many of the people with high Betweenness Centrality scores have actually sent a
few messages, but use a very large distribution list. The node, which is part of
the e-mail address " _D_E18_62_I_Studierende@mx.ds.fhnw.ch" has for
example, the fifth highest betweenness centrality, but is obviously only a
distribution list.
The Betweenness Centrality value indicates the crucial links between the existing
sub-networks, but does not correspond to what the account owner would
subjectively classified as most "important." If the the nodes of people, who never
sent a message were removed, it would result in a star-shaped structure and the
relevance of Betweenness Centrality would decline even more
A much different picture emerges when the node size is scaled with the Influence
measure, as can be seen in Figure 28. The people who are drawn
correspondingly large in this figure are mainly people from a narrow circle of
colleagues, as well as the student’s individual faculty / advisor. So people who
were actually relevant in this communication network with respect to the
student’s life are identified.
57
Social Network Analysis
Figure 28 individuals with high Influence are more centrally in the network .
A very similar situation was observed in another e-mail account. The email
account of a FHNW employee, who mainly communicates in the observed period
with two other employees.
In Figure 29, these employees are colored in yellow and light blue, while the
node of the account owner is itself represented in green.
Figure 29 Node size corresponds to the betweenness centrality of the nodes.
The person in dark blue very often sends e-mails with large distribution lists and
little content and has the second highest Betweenness Centrality score.
58
Social Network Analysis
A comparison with Figure 30 shows that the dark blue person had virtually no
influence on the network. The other two who have actively cooperated with the
person in green are clearly important in this communication network.
Figure 30 Node size corresponds to the influence of the node.
3.4. Discussion
The previous chapter dealt mainly in classifying the actors of the network in
accordance with their Influence. In this case, additional node and edge metrics
have been computed. This chapter describes patterns in the data, which can be
found using the statistical technique of analysis of variance.
3.4.1. Properties of influential actors
Are there common properties of influential actors?
Yes, the critical factors can be found quickly: Activity ( number of messages ) and
betweenness centrality, both have a significant correlation value, based on an
analysis of variance at the 1% level. This means that, for example, persons who
actively communicate with multiple project teams and thereby pass on ideas and
solutions between the project teams and their members are particularly influential.
Note: At times, even some "spammers" are very active and very central in the
network, but overall, have a lesser impact on their surroundings. This can apply
to be both Twitter as well as e- mail networks. Such “spammers” can be
characterized as often using the same words. A Twitter account, which always
writes about the same issues, has much less influence than someone who takes
59
Social Network Analysis
up various topics, even if disseminated to only individuals in the same network,
since the effect of this is dependent on whether someone himself already used
the words often. So, can someone who already does a lot of writing on a
particular topic be affected by a "spammer", which uses the same terms? No,
even if someone is frequently retweeted on Twitter, this is useless if it is always
the same people.
3.4.2. Properties of influential messages
It is practically impossible to predict with a sufficiently high probability that a
message will be influential or not, without possessing sufficient meta information.
Nevertheless, there are certain properties for successful messages.
Text length is a crucial factor, especially on Twitter. Very short, tweets are often
not worth it to be retweeted, since they contain hardly any important information.
For e-mails, it looks similar. Very short messages often contain little information,
such as, for example, the commitment to an event or even just a reference to
another with a “best regards” in the end.
Sentiment and emotion do not seem to have any influence on the impact of a
message. No correlation was observed. High complexity seems to have a very
small, positive effect, but cannot be confirmed with the test data at the 5% level.
For e-mails a key factor is the number of recipients in a message. The higher the
number of recipients, then the less the impact has of the individual message.
This shows that it is not useful to cc many participants on a communication,
because few people pay attention to it. On Twitter there is no harm, however, for
people to mention others with an @Mention, because people then feel they have
been directly addressed and react with a greater likelihood of replying.
3.5. Influence Advantages and Other Applications
The calculation of the new Influence metric for network analysis brings several
advantages, since with this value the actual text is analyzed, instead of only the
connections, and thus this measure is much more difficult to manipulate. Pure
structural network metrics such as Betweenness Centrality can be easily
manipulated by a large amount of connections to other nodes. In contrast, to get
a higher Influence score, you have to actively communicate to others and then
have those other people change their behavior and include new terms in their
subsequent messages. Of course, the Influence metric too can be deliberately
manipulated. For example, two people could send each other dictionaries, but
this would take significantly more effort and "criminal energy" is required.
60
Social Network Analysis
Although there is still the danger of manipulation, taking into account the text is a
good step towards a more balanced analysis of the communication behavior.
Even Google no longer exclusively uses the weighted PageRank to present
search results and benefits from using text analysis. Social network conventional
tools for analysis of graphs, such as Gephi [33] unfortunately do not offer text
analysis opportunities and thus leave out the content dimension in network
analysis.
Section 3.2 describes four specific applications for the use of new Influence,
metric, but the possibilities for additional use cases are nowhere near exhausted.
For example, blogs have not been taken into account, but these have a lot of
potential for interesting research. An influential blog would be one, which quickly
takes up new topics, and then readers then communicate these new topics,
ideas and innovations to other blogs. This raises the question of whether some
frequently simply copy from others.
In email networks, the main problem of using the Influence metric is the
accessibility to people’s email. However, private networks can be easily analyzed,
including for example one’s own e-mail as Section 3.2.4 shows that the analysis
is likely to find important people in the network.
61
Social Network Analysis
4. Conclusion
The inclusion of text analysis allows important insights into the analysis of social
networks. The calculation of the influence of a single message, and its direct
impact on a receiver is a useful extension and generalization of existing
approaches, which often work only for individual, predefined networks.
The biggest challenge is addressing the variety of individual network properties
that need to be taken into account in order to convert the messages into a
common schema for efficient analysis. However, this study demonstrates that
these challenges can be overcome and it is possible to trace the diffusion of new
ideas, words and concepts among users over time based on the content of their
digital communication.
A disadvantage of the method is it is not optimized for a particular network, or for
a specific language. The Influence metric calculation assumes that people have
not used identified keywords in prior communications, but this assumption may
not always be true, because of the lack of a sufficient historical data going further
backward in time.
However, the selected four test cases have demonstrated that a relatively wide
range of possible applications can be covered with meaningful accuracy.
Compared to the common structural network measures, the new Influence
content measure has out performed them in identifying the influential people in a
human communication network.
62
Social Network Analysis
5. Recommendations for Future Work
With this work, the possibilities of textual analysis of Condor 3 were significantly
increased, but there is significant room for improvement. Currently, the semantics
of the text is hardly checked, and the introduction of an ontology could be useful.
Thus, it might be possible to detect influential messages even when the receiver
is not using exactly the same words, but talks about the same subject with a
related lexicon. It should also be tested if better standardization of texts optimizes
the analysis even more.
To investigate the value of the Influence measure even further, additional
networks such as, Tumblr, or something similar, could be considered. With
enough data, it should be possible, for example, to find persons who are in
particularly successful for the rapid spread of the Internet Memes [34]. Such
studies can help to understand how information spreads on the Internet and what
factors play a role. The findings could then possibly be used to predict political
unrest in crisis-endangered regions. [35]
Condor 3 has with the extensions created in this work, a large amount of analysis
tools, which can be used for various purposes. It cannot map all the occurrences
of a network, but the existing information is sufficient for several analyzes, which
open the door for future research in network analysis and in the field of Cool
Hunting.
63
Social Network Analysis
6. Bibliography
[1] P. Gloor and S. Cooper, Coolhunting : Chasing Down the Next Big Thing,
AMACOM, 2007.
[2] LC Freeman, "A Set of Measures of Centrality Based on Betweenness, "
Sociometry, Vol 40, No 1, pp. 33-41, 1977.
[3] L. Page, " PageRank : Bringing Order to the Web, " Stanford Digital Library
Project, Stanford, 1997.
[4] P. Gloor, " Condor Core, " Galaxy Advisors, in 2012. [Online]. Available :
https://galaxyadvisors.com/services/condor-core.html. [Accessed on January 10,
2013] .
[5] R. Diestel, Graph Theory, Berlin, New York : Springer -Verlag, 2005.
[6] G. Sabidussi, "The centrality index of a graph, " Psychometrika, vol 31, pp.
581-603, 1966.
[7] P. Gloor and Y. Zhao, " TECFLOW - A Temporal Communication Flow
Visualizer for Social Networks Analysis, " ACM CSCW Workshop on Social
Networks, 2004.
[8] L. Brönnimann, " Multilanguage - sentiment analysis of Twitter data on the
example of Swiss politicians, " Windisch, 2013.
[9] F. Lüscher and M. Brun, " Sentiment Analysis, " University of Applied
Sciences Northwestern Switzerland, Windisch, Switzerland, 2013.
[10] A. Go, R. Bhayani and L. Huang, Twitter Sentiment Classification using
Distant Supervision, Stanford, 2009.
[11] Apache OpenNLP Development Community, "Apache OpenNLP Developer
Documentation, " in 2012. [Online]. Available :
http://opennlp.apache.org/documentation/1.5.2- Incubating / manual /
opennlp.html. [Accessed on January 10, 2013] .
[12] E. Bakshy, I. Rosenn, C. Marlow and L. Adamic, " The Role of Social
Networks in Information Diffusion, " in Proceedings of the 21st international
conference on World Wide Web., ACM, 2012.
64
Social Network Analysis
[13] X. Wei, J. Yang, LA Adamic and M. Rekhi, "Diffusion dynamics of games on
online social networks, " in Proceedings of the 3rd conference on Online social
networks., USENIX Association, 2010.
[14] " Klout, " Klout Inc., 2014. [Online]. Available : http://klout.com/corp/how-itworks. [Access on 4 February 2014].
[15] G. Bogdan, A. Zygmunt and S. Podgorski, " Incorporating text into Evolution
of Social Analysis, " Computer Science and Information Systems ( FedCSIS ), pp.
931-938, 2013.
[16] J. Jang and S.-H. Myaeng, "Discovering Dedicators with Topic -based
Semantic Social Networks, " in Seventh International AAAI Conference on
Weblogs and Social Media, 2013.
[17] J. Weng, E.-P. Lim, J. Jiang and Q. He, " Twitter Rank: Finding Topic sensitive Influential Twitterers, " in Proceedings of the third ACM international
conference on Web search and data mining, ACM, 2010.
[18] R. Dawkins, The Selfish Gene, Oxford University Press, 1976.
[19] G. Salton, A. Wong, and C. Yang, "A Vector Space Model for Automatic
Indexing, " Communications of the ACM, Vol 18, No. 11, pp. 613-620, 1975.
[20] EM Rogers, Diffusion of Innovations, Free Press of Glencoe, Macmillan
Company, 1962.
[21] L. Brönnimann, "Sentiment in Word Clouds, " Windisch, 2013.
[22] IR Authorities, "ISO 639, " 9 September 2013. [Online]. Available :
http://www.iso.org/iso/home/standards/language_codes.htm. [Access on
February 2, 2014].
[23] S. Nakatani, " Language Detection Library, " Cybozu Labs, Inc, Tokyo,
Japan, 2010.
[24] L. Brönnimann, "Analysis of communication from Swiss politicians on Twitter,
" Windisch, 2013.
[25] TM Fruchterman, "Graph Drawing by Force -Directed Placement ," Software
- Practice & Experience, Vol 21, No. 11, pp. 1129-1164, 1991.
[26] L. Brönnimann, " Swiss politicians on Twitter, " August 2013. [Online].
Available : http://www.twitterpolitiker.ch. [Access on 4 February 201].
65
Social Network Analysis
[27] H. Saif, Y. He, and H. Alani, " Semantic sentiment analysis of twitter, " The
11th International Semantic Web Conference ( ISWC 2012) 11-15 November
2012.
[28] P. Gloor, M. Paasivaara, D. Schoder and P. Willems, "Finding collaborative
innovation networks through Correlating performance with social network
structure, " International Journal of Product Research, Vol 46, No. 5, pp. 13571371, 2008.
[29] T. Grossenbacher, R. Straumann, F. Zirin and T. Wider, "Some Polis - Social
Media and Politics, " 2013. [Online]. Available : http://www.somepolis.ch/.
[Access on 4 February 2014].
[30] " Twitalyzer, " 2013. [Online]. Available : http://www.twitalyzer.com/. [Access
on 4 February 2014].
[31] Edelman, " Tweetlevel, " in 2012. [Online]. Available :
http://tweetlevel.edelman.com/About.aspx. [Access on 4 February 2014].
[32] M. Cha, H. Haddadi, F. Benevenuto and KP Gummadi, "Measuring User
Influence in Twitter: The Million Follower Fallacy, " ICWSM, Vol 10, pp. 10-17,
2010.
[33] M. Bastian, S. Heymann, and M. Jacomy, " Gephi : an open source software
for exploring and Manipulating networks, " in International AAAI Conference on
Weblogs and Social Media, 2009.
[34] M. Coscia, "Competition and Success in the meme pool : a Case Study on
Quickmeme.com, " Center for International Development, Harvard Kennedy
Schoo, 2013.
[35] H. Philip, A. Duffy, D. Freelon, M. Hussain, W. and M. Mari Mazaid,
"Opening Closed Regime : What Was the Role of Social Media During the Arab
Spring, " PIPTI, Seattle, 2011.
66
Social Network Analysis
7. List of Figures
Figure 1 Illustration of a Twitter network, in which all connected by the node in
the middle
Figure 2 Representation of a Twitter network with and without actors without
outgoing Compounds
Figure 3 Random elected politicians are often connected to each other through a
few nodes
Figure 4 The e- mail network of a single person can be very different structures
Figure 5 In this e- mail network project teams communicate primarily among
themselves
Figure 6 In this representation, the color value shows the betweenness centrality
of the node. Dark blue nodes have the highest value, the deepest red
Figure 7 Screenshot of the user interface of Condor
Figure 8 The default settings should automatically use the most likely field for
analysis.
Figure 9 Menu to calculate the influence
Figure 10 The Fruchterman - Reingold algorithm provides a useful visualization
of the Graphen
Figure 11 The same network as in the previous figure benefits of different options
Darstellung
Figure 12 details of a single node after analysis
Figure 13 details a single edge after analysis
Figure 14 The influential people in this network quickly apparent
Figure 15 Error message appears when no text analysis has taken place
Figure 16 Sentiment and activity diagram
Figure 17 Overview of the most important words of the network and its
distribution
67
Social Network Analysis
Figure 18 The parliamentarians in this network are colored according to their
party affiliation, other Twitter accounts are drawn as points
Figure 19 Number of followers as a measure of the importance of politicians
Figure 20 betweenness centrality as a measure of the importance of politicians
Figure 21 Influence as a measure of the importance of politicians
Figure 22 On the picture are two subnetworks seen, the smaller with the larger
Spanish and with English tweets.
Figure 23 The node size depends on the influence of people. In this area of the
network Ocean_BMW seems to be very influential
Figure 24 The word clouds view scales the Bigger the words with their frequency
of occurrence in the data set. Green words are positive, red is negative.
Figure 25 A calculation of the betweenness centrality provides little new
information about the network
Figure 26 Shows the value of influence, that there are certain differences
between the people of the project team.
Figure 27 The betweenness centrality in this network has only a very limited
explanatory power
Figure 28 Individuals with high influence are more centrally in the network.
Figure 29 The node size corresponds to the betweenness centrality of the nodes.
Figure 30 The node size corresponds to the influence of the node
68
Social Network Analysis
8. Glossary
Betweenness Centrality
The betweenness centrality is a commonly used structural network metric based
on the central node topology. This is done by measuring the number of times a
node is found between all other nodes along the shortest path in a network. The
exact calculation can be found in chapter 2.2.1
Cool Hunting
Cool Hunting is often referred to as a trend hunting and consists of the
professional looking for new trends and trendsetters. So-called Coolhunters
investigate, among other things, the current youth culture and provide their
insights and predictions about the further development of companies that try
therefore to obtain a market advantage.
Hashtag
Tweets are often provided with so-called hashtags, which consists of a pound
sign ( # ) followed by a term or consist of an abbreviation. The hashtag #1zu12
describes, for example, that it goes to 12 initiative in the Tweet to the 1, which
would not have been immediately obvious perhaps from the rest of the tweets.
Metrics
In the context of this work, metric indicates a measurement system, which
measures quantifiable units according to a predetermined scale. A metric, for
example, the number of sent tweets, or the number of followers.
Natural Language Processing
Using Natural Language Processing ( NLP for short ) can be written by people to
analyze text. NLP is an umbrella term for algorithms and methods for text
analysis. For example, the automatic translation of texts, or the determination of
the language of a text. Also, Sentiment analysis is a subcategory of Natural
Language Processing.
Retweet
Further distribution of tweets by another person on Twitter called Retweet. Such
a tweet is marked before each message with the letters RT, followed by the user
name of the original sender.
Sentiment
With the sentiment, the content and the ideas of a sentence can be classified. A
text can be, for example, positive, sad, happy, angry or neutral. In this work, a
distinction is made between positive, negative and neutral sentiment.
69
Social Network Analysis
Sentiment analysis
In sentiment analysis attempts to determine the sentiment of the text. With
machine-based method, the polarity of a text is determined, different algorithms
can be used.
Stopword
Words that occur very often in a given language, and thus provide little
information are called stop words. In English, this is for example the article "the "
or the word "and".
Tweet
A single message on Twitter called a Tweet, which has a maximum of 140
characters.
Tweeted
The process of writing a tweet and to publish it on Twitter called a Tweet .
Word Cloud
To illustrate the most important concepts in a text or a collection of texts, a word
cloud is often used. This is an image consisting of the most frequent words in a
text. The more frequently the word appears, the larger the word is displayed in
the word cloud.
70
Social Network Analysis
9. Honesty Statement
I hereby declare that this project is the result of my independent and independent
work. All sources used are given and citations were marked as such.
Date / Place
Lucas Brönnimann, MSE student
71
Social Network Analysis
10. APPENDIX A
Here is the complete list of parliamentarians with a Twitter account, according to
the analysis of section 3.3.3
Person
Influence
Betweenness
Followers
Rank:
Influence
Rank:
Betweenness
Rank:
Followers
72
Social Network Analysis
APPENDIX A (continued)
Here is the complete list of parliamentarians with a Twitter account, according to
the analysis of section 3.3.3
Person
Influence
Betweenness
Followers
Rank:
Influence
Rank:
Betweenness
Rank:
Followers
73
Social Network Analysis
11. APPENDIX B
Here the students of COIN 2013 are listed with the complete evaluation. The
evaluation can be found in chapter 3.2.3
Name
Answers
Influence
Number of
Messages
74
Social Network Analysis
APPENDIX B (continued)
Name
Answers
Influence
Number of
Messages
75
Social Network Analysis
APPENDIX B (continued)
The following table refers to the analysis of the COIN, 2012.
Name
Answers
Influence
Number of
Messages
76
Download