MS Word - Web Science Repository

advertisement
Exploring Area-Specific Microblogging Social Networks
Ece Aksu Degirmencioglu
Suzan Uskudarli
Bogazici University
34342 Bebek
Istanbul, Turkey
Bogazici University
34342 Bebek
Istanbul, Turkey
eceaksu@gmail.com
uskudarli@cmpe.boun.edu.tr
ABSTRACT
Social networks can be used to find people who share similar
interests or people who have knowledge in a specific domain. One
method to find people is based on search by specific information
about people or search by specific keywords they use. This
method is limited to the explicit information provided by people.
The other method is based on ranking people by their popularity
to suggest a list of popular people for a given interest area.
However, this method hides people who provide valuable content
but not so popular in the network. In this paper, we examine what
people contribute rather than what they declare about themselves.
The idea is that their value depends on what they contribute. We
propose a method for identifying interest areas of people without
relying on the explicit information declared by them. Moreover,
by applying our model, we identify interest-specific microblogger
communities and extract sets of relevant words related to different
interest areas.
Keywords
Interest Networks, User suggestion, Twitter, keyword cooccurrence, social network analysis, community extraction
1. INTRODUCTION
Social applications in the Internet today allow people to
interactively share their knowledge, resources such as video, text
and images; collaboratively perform tasks such as updating
documents, find friends and to communicate. The popularity of
these applications in online social networks increased after the
advent of Web 2.0 [1] technologies due to user generated content.
Among different types of social applications, microblogging
environments became the fastest growing type of applications
with the introduction of Twitter in 2007. With its over 14 million
users, Twitter’s growth has been declared as 1392 percent in
2009.
Microblogging provides a very simple, short but efficient and
quick way of spreading and retrieving information. It allows users
to expose their ideas, feelings, interests, knowledge and expertise
by means of short text messages. Users also interact through
microblogs. To express the desired content in a space efficient
manner, various space conserving conventions and notations have
emerged. For example in Twitter, hashtags are tokens that start
with a hash symbol (#) prefix. They are used to tag microblogs
(tweets) they occur in. A microblogger’s contributions:
-
may relate to numerous topics
-
are fragmented thoughts into many microblogs
-
may be duplication of somebody else’s content
-
may contain references to other users, external links,
tags etc.
The nature of microblogs causes two major problems. First, it is
very difficult to distinguish valuable content among all the
contributions. Second, it is hard to find users who contribute
actively and contribute valuable information to follow.
Search tools for content and users are available for microblogging
environments and specifically for Twitter. One of the two
approaches is keyword matching. People are encouraged to search
for specific keywords to apply keyword matching to find
information which contains the given keywords. Keyword
matching is used either to find content [2] or users [3]. When
content is searched by keywords, it causes thousands of
contributions from many different users to appear in the result set
which make it very hard to filter out the irrelevant content. When
users are searched by keywords, the result set depends on the
explicit information declared by the users themselves. However,
in the case that people’s declarations and their contributions are
not aligned, misleading results are returned from queries by
specific keywords. The other approach is popularity based
ranking. With this approach, discovery of users who share
relevant and valuable information is based on the popularity of
users measured by quantitative variables such as the number of
followers or contributions. This kind of search tools, leave the less
popular but more valuable users hidden among millions of other
users.
In this study, we focus on finding users who are interested in a
specific area by processing their microblog contributions. We also
focus on finding the relations between users who build up a
community around similar interests. This work aims to;
(1) identify a microblogger’s interest given their contributions
(2) identify a community of interest given a microblogger
(3) examine social network
communities of interest
properties
of
microblogger
It proposes first to process a collection of microblog contributions
and reduce them into a set of keywords representing the nature of
their content. Secondly, to identify a community of interest based
on a user.
In the following sections, the proposed model is described in
detail and evaluated based on the data collected from Twitter.
2. BACKGROUND
2.1 Twitter
Twitter[4] is free social networking and microblogging service
which allows users to publish, share and retrieve content known
as tweets. Twitter also supports video and image formats in
addition to the text.
In Twitter, short messages are called tweets and each tweet is
limited to the 140 characters. Due to the shortness of the tweets,
they contain abbreviations, short URLs, user names and special
notations such as words prefixed with ‘#’ or ’@’ characters. The
real content is usually given in the links external to Twitter. In
some cases, the content may also be fragmented over several
tweets. Users follow other users’ tweets which means that all the
contributions from all other users they follow listed in a streaming
line.
Other than publishing tweets, users can follow other users in
Twitter. Users are in follower role when they follow another user
and listed in the followers list of the user they follow. The list of
users who are being followed are named as friends. The tweets are
broadcast to one direction in Twitter which means that the
followers of a user are able to see the published tweets by that
user. However, any other user who comes across with them in a
search is also able to see any other user’s tweets as long as the
security and privacy settings are set to public instead of private
option.
A reply is a special message sent from one user to another. It is
distinguished from a normal tweet by the at sign (@) prefix of
users. If a tweet begins with a @username, it is a reply. If the
tweet has @username but not at the beginning of the tweet, it is
considered as a mention in Twitter. Twitter displays the tweets
which has @username on user’s home page. There is no
requirement for users to be following other users in order to see
the replies or mentions to them. While replies and mentions are
broadcast publicly to all other users who are not intended as well,
Twitter also allows users to send private messages, named direct
messages, from one person to another. Direct messages can only
be sent to the followers. Direct messages are out of our scope in
this study since they are not publicly retrievable.
ReTweet, in the social networking and microblogging service
Twitter, to re-post something posted by another user, usually
preceded with "RT" and "@username" to refer to the original
poster. Retweets are used very frequently in twitter and has a
dramatic influential effect on users [5]
Tags can be considered as a bottom up approach for classification
when compared to the taxonomies which are defined by experts
for a limited set of items hierarchically with a top down approach.
In taxonomy, there is one way to classify each item. However,
tags can be classified in many different ways since it has a flat
structure [6].
A special type of tags is hashtag which is used in microblogging
systems such as Twitter and Tumblr [7]. Hashtags are the words
or phrases with prefix hash sign (#) and with multi words
concatenated. Throughout this study we refer to hashtags and tags
interchangeably.
Users in Twitter are represented with a unique screen name. They
optionally enter their profile information such as name, biographic
information, location and web site address.
In order to organize users they follow, users create lists in Twitter.
By means of lists people are grouped by specific subjects or
Copyright is held by the authors.
Web Science Conf. 2010, April 26-27, 2010, Raleigh, NC, USA.
interest areas. However, it does not always show us that a user
listed in a list publish valuable content in the subject matter of the
list. On the other hand, users who are not listed in any of the lists,
since lists are optional, may be producing more valuable content
than the users we retrieve via lists. In our study, we keep, all
explicit categorizations given by the users, out of our scope and
focus on the content to explore the communities or user groups
that come together around a specific area of interest. For this
reason, we do not use lists as a parameter to discover the
relationships based on interests.
3. RELATED WORK
Social networks can be used to find people who share similar
interests or people who have knowledge in a specific domain.
Using social networks to share knowledge is a very efficient way
of reaching information or simply finding answers to questions.
There are two types of methods to discover people in
microblogging environments. One method is based on search by
specific information about people or search by specific keywords
they use [2][17]. It is based on keyword search and limited to
information explicitly declared by the users such as name,
location, marital status, interests etc. Since the users often do not
declare their interest areas and other information which may help
finding them, it is usually a time consuming task to locate those
who are of interest. The other method is based on ranking people
by their popularity to suggest a list of popular people for a given
interest area [3][8][16]. These people suggestion tools rank
people by their popularity measured by the number of other
people following them. However, popularity based methods make
the popular people more populer while keeping the valuable but
less popular ones hidden. In addition, popularity based ranking
methods make it impossible to find correct people when the
subject matter is to interact, communicate or simply ask questions
to retrieve specific information.
4. PROPOSED MODEL
We propose a method for identifying users interest areas and the
communities based on these interests.
Our approach aims to identify a microblogger’s interests by
processing their contributions. It is assumed that, as long as users
actively publish microblogs aligned with their interest areas, it is
possible to identify these interest areas by processing their
contributions. In addition, it is considered that hashtags are key
elements of microblogs which help to distinguish specific areas of
interest among conversational content.
In our method, interest areas of users are identified by extracting
the hashtags they use in their microblogs. Tag clouds are
representations of all the tags assigned by a user in a form that the
most frequent tags can be distinguished in a single view. Tag
clouds of users are enriched by associating hashtags and nonhashtag words based on their co-occurrence in the same
microblog. The important point here is that;
- Once it is investigated that a word has been used as hashtag
in any microblog from any of the users in the dataset, while
investigating the co-occurrence relations, it is not necessarily be
used in the form of hashtag in other microblogs from any user.
- Enriching the tag clouds of a user is based on all
microblogs from all users which means that, while investigating
the co-occurrence relations, all hashtag and non-hashtags from all
the users in the dataset are considered.
Once the hashtags are associated with the users, the users
who use common hashtags are also associated in order to extract
communities of interests.
Our method in general consists of two steps: 1-) Processing
Microblogs 2-) Networks Analysis. These steps are shown in the
Figure 1 below with the sub-steps in each.
Input User Name
Retrieve Tweets
Twitter
Retrieve all the
words used as tag
at least once
Tweets
Parse Tweets into
Words
Categorized
Words
Words
Categorization
Stop Words
Links
Tags
Emotion Words
General Words
Time Related
Words
Mentions &
Replies
Punctuations
Removal
Find Tag-Word
Co-ocurrences
Find Users-Tags
Relations
Find CoOccurrence Freq
Find User-Tag
count
Generate Words
Network
Generate Users Words Network
Elimination
Words &
Tags
Interest
Area 1
Interest
Area
.... 2
Community
1
Community
.... 2
Interest
Area N
Community
N
4.1 Processing Microblogs
Microblogs contain words or phrases which are relevant keywords
for the interest areas of the user they are published by. However,
microblogs also contain other words which are irrelevant to user
interests. Examples of such words are ‘I think’, ’thanks’, ‘great’
etc. In addition to the words which are irrelevant to user interests,
punctuation marks are also frequently used in microblogs.
In our proposed model, we focus on discovering the words which
are candidate keywords for interest areas of users who post the
content. Hence, we aim to have a precise set of relevant words
which give us clues to explore interest areas and communities
attached to these interest areas. To have a precise set of relevant
words, tokens are categorized first, then irrelevant categories are
filtered out as described in the following paragraphs of this
section. The processing algorithm performs the following steps:
Parsing: All microblogging contributions are parsed resulting in a
bag of tokens.
Tokens Categorization: The tokens are categorized into the
following categories in this step. By means of categorization, it is
possible to eliminate the irrelevant ones in the next steps.

Network Generation
Pre-Processing
Tokens are categorized as stop words to be able to eliminate
them in the following steps of our algorithm since they do not
contain relevant or significant information to be used to
discover interest areas of users.
Figure 1. Proposed Model
User text contributions are processed in the following manner;

Stop Words: Stop words are the words either insignificant
such as prepositions, articles etc. or so common that they can
not be related to any specific interest area [9]. They vary from
language to language and system to system. Hereby, we refer
to the stop words defined in English.

Links: Microblogs are short messages. Due to the character
limitation of microblogging systems, microblogs contain links
where users can navigate to other web sites external to the
microblogging system. By means of links, details of the
information given in the microblog can be retrieved. In our
model, we categorized all the words starting with the character
set “http” as links.

Hashtags: Hashtags are words in microblogs with a hash sign
“#” prefix. Due to the character limitation of microblogs,
hashtags are used to allow users to give brief information
about the content of the links, pictures or simply the text they
publish. Hashtags also used to search for what users had
posted previously.
Any set of users are selected. Such as;
o
A set of randomly selected users
o
A set of users in a list
o
A set of users who are already known that they have
conversations

All microblogs published by these selected users in a
microblogging environment are retrieved,

Microblogs are parsed into tokens,

Tokens are categorized into the categories defined in the
Processing Microblogs following in this Chapter,

Tokens which are defined in the Elimination part in the
Processing Microblogs following in this Chapter are filtered
out,

Interest areas of users are defined by associating the users and
hashtags they use,

Tag clouds of users are enriched by associating hashtags to
other words they co-occur in the same tweet,

Interest based user networks are generated by associating
users who use common hashtags in their microblogs,
These steps are explained in detail in the next sections.
In our model, hashtags are associated with the users to
identify their interest areas. It is assumed that if a word is used
as hashtag in a microbog, it is more valuable than other words
used in the microblogs in terms of relevancy to the interest
areas of users.

Mentions & Replies: The tokens are categorized as mentions
if they have at sign “@” prefix and not in the first order in the
microblog.
Example:
Leaving house for first time since Xmas eve. Excuse to wear
red duffle coat from sister and sparkly scarf from @evwa!
@evwa is a user who is mentioned in this tweet.
The tokens are categorized as replies if they have at sign “@”
prefix and in the first order in the microblog.
Example:
@delta_goodrem Hope u had a Beautiful xmas Delta!
@delta_goodrem is a user who is replied in this tweet.
In our method, all the tokens containing “@” character at the
beginning are categorized either as mentions or replies.
However, it can not be certain that it is really a reply or a
mention since there is no specific rule or any restriction for
users to use this notation in microblogs. It can only be said
that such tokens are user names.
In our algorithm, all the mentions and replies are eliminated
since the focus is on the interest area relevant keywords.

General Words: Based on microblogging systems dynamics,
it is commonly seen that users refer to the microblogging
application itself as well. Such common words are specified in
our model and eliminated. Also other common words which
are used frequently but not treated as stop words are
categorized as general words. The selected words in this
category are:
General Words= {“twitter”, “RT”, “RT@”,”tweet”, “free”,
“check”}

Emotion Words: When people post microblogs, they add
their feelings as well. Tokens which express feelings of users
are categorized as emotional words. Emotion words are
eliminated in our model.
Emotional Words={“good”, “big”, “awesome”, “amazing”,
“fine”, “nice”, “bad”, “beautiful”, “love”, “enjoy”, “hate”}

hashtag are associated by using the method of co-occurrence.
Finally we explain how we define associations between users
based on common interests they have.
4.2 Network Analysis
In the second step of our model, we have three main objectives; to
discover a set of words relevant to an interest area , to discover
the interest areas of users and to generate interest based
communities.
Relations among users and their interest areas are represented in
the form of different types of network graphs. In the following
sections, these networks are explained in detail.
4.2.1 Relevant Words Network
This network aims to present the set of relevant words specific to
an interest area. The relations between words allow us;

to discover and categorize the set of words which belong to a
specific interest area

to navigate through related interest areas

enrich the tag cloud of each user so that users can be searched
by other words which are not used by them but relevant to the
interest area of that specific users
A sample representation of this type of network is shown in the
Figure 2.
Time Related Words: Tokens describing the time of the
events or information given is frequently used in microblogs.
These tokens are not in our focus since they are common
words which are not specifically related to any interest area.
Though there are many time related words in English
language, the most common ones are defined as time related
words as follows in our model:
T= {“today”, “tomorrow”, “now”, “2009”}
Punctuation Removal: All non-alphanumeric characters are
removed from the tokens extracted in the parsing step (i.e.
“favorite!!” becomes “favorite”). Alphanumeric characters are the
set of numbers 0 to 9 and letters A to Z. All the characters which
are not contained in this set are non-alphanumeric characters.
Figure 2. Sample Network of Relevant Words
Punctuation removal is done after the categorization of the tokens.
The reason for this is that the categorization is based on the
special notation in microblogs which help identifying the types of
the tokens (i.e. hashtags, mentions, links etc.).
It has two types of nodes which are;
Elimination: The last step of processing microblogs is the
elimination where irrelevant tokens to any interest area are filtered
out to have a concise list of words which are expected to be more
relevant to any specific interest area. The set of concise words are
used to explore interest areas of users and associations between
users based on these interest areas in the next steps of the model.
A word may possibly be used as hashtag in a microblog but the
same word may also possibly not be used as hashtag in another
microblog. In this case, the word is counted as an element of the
first set T.
The tokens in the following categories are filtered out in this step:
links, stop words, emotion words, general words, time related
words, mentions and replies. The final set contains either tags or
non-tags words.
In the following sections, we explain how interest areas of users
are identified by associating users and hashtags they use in their
microblogs. Then we explain how other words relevant to a

T = { t1 ,t2 ,t3 ,…. ,tn } A set of hashtags

W = {w1 ,w2 ,w3 ,…. ,wm} A set of other words which are not
hashtags
If a hashtag t in T co-occur with a word w in W in any of the
microblogs at least once, there is an edge between the hashtag t
and the word w. The weight of the edge is the number of
microblogs that they co-occur. Edge weights give us directly the
co-occurrence relatedness of words.
- E = { (t, w) | t ∈ T and w ∈ W ∧ t ≠ w}
Jane
Weight(e) =
Mike
To simulate a sample case when calculating the weights, a basic
example is given here;
John
Figure 3. Sample Network of Users-Interest
Microblog1 = “Reading papers about OMG’s #MDA to see what
work has been done on PIM->PSM transformation.”
It has two types of nodes which are;
Microblog2= “MDA and OMG are related concepts”

A set of users who uses at least one hashtag in at least one of
their microblogs U = {u1 ,u2 ,u3 ,…. ,um}

A set of hashtags used any of the users in U at least once T = {
t1 ,t2 ,t3 ,…. ,tn }
In the first microblog, “mda” is a hashtag. After the hashtag
“mda” is identified as an element of T, all microblogs containing
“mda” either as hashtag or non-hashtag is retrieved. For this
reason, by counting another microblog where “mda” is not a
hashtag in it, the frequency of the co-occurrence of “MDA” and
“OMG” words are assigned as two since they co-occur in both of
the microblogs.
The set W does not contain any words which are in the set T in
order to avoid connection between the same word which is used
as hashtag in one microblog and not used as hashtag in another
microblog.
T ∩W = {}
Based on the co-occurrence of two words in the same microblog,
we generate a network which shows us the relevancy between all
the tokens which are output of the elimination step of the
processing microblogs. Hashtags are different than non-hashtag
words since they give precise information about the content hence
the interest areas. Instead of trying to find all co-occurrences of all
words in all contributions, hashtags are considered as hubs and
words which co-occur with hashtags are associated with these
hubs. By doing this we filter out the irrelevant words such as
conversational ones and assume that the words co-occur with the
hashtags are possible candidates to be keywords for interest areas
even if they are not used as hashtags in the microblogs.
Based on the approache explained above, the following steps are
executed to generate the relevant words network.
The assumption here is that hashtags are key elements to discover
the interest areas of users. And users who do not use any hashtags
in their microblogs cannot be associated with any specific interest
area by using our model.
If a hashtag t ∈ T occurs at least once in all microblogs posted by
the user u ∈ U, there is an edge between the hashtag t and the user
u. In order to evaluate the strength of the edges hence the relation
between a user u and the hashtag t, we assign weights w(eu) to the
edges.
EU = {(u ,t) | t ∈ T and u ∈ U}
Weight(eu) =
This network is used to generate user networks based on the
hashtags they use in common. In the next section we give the
formal definition of the interest based user networks.
4.2.3 Interest Based User Networks

Step1: Retrieve all the words used as hashtag at least once

Step2: Find all the words co-occur with the words found in
the first step at least once

discover communities based on specific areas of interest

Step3: Find frequencies of each co-occurrence

navigate through users who are connected based on an interest

Step4: Generate words network
A sample representation of this type of network is shown in the
Figure 4.
Relevant words network identifies the interest areas in a given set
of microblogs. Once the interests are extracted, the next step is to
associate these interest areas with the users.
This network aims to present the set of users who are connected
based on an interest area. The relations between users allow us to;
In the following section, we describe how the users and their
interest areas are associated in detail.
4.2.2 Users – Interest Areas Network
This network aims to present the relations between users and their
interest areas. Users are associated with the hashtags they use at
least once in any of their microblogs.
A sample representation of this type of network is shown in the
Figure 3.
It has one type of node which is;

A set of users who uses at least one hashtag in at least one of
their microblogs U = {u1 ,u2 ,u3 ,…. ,um}
The set of hashtags which are used in common by at least two
users are also defined as T = { t1 ,t2 ,t3 ,…. ,tn } in this network.
However, they are not presented in the network graph as another
type of node.
The assumption here is that users who have common interests
tend to use the same hashtags. The more two users use the same
hashtags in their microblogs, they are more likely to share similar
interest areas.
We define interest based user networks as a weighted undirected
graph. We assume that users who use a hashtag t in T at least once
are represented as u in U. We define undirected edges between
users e in E if two users use the same tag at least once. In order to
evaluate the strength of the edges hence the relation between two
users, we assign weights w(e) to the edges. The number of
common tags two users use in their tweets gives us the value of
the weight of the edge between them.
E={( un, um) | u ∈ U}
Weight(e) =
user, the more we explore their interest areas and relations in
terms of these interests.
As a result of our decision, we selected 50 users from the
Wefollow [8] web site where the users are categorized into
specific areas and ranked by their popularity. Then we selected
other users who are replied the most by each of these 50 users. By
ranking the users who are replied we selected 20 most replied
ones and collected their tweets. The size of our data is given in the
table below:
Table 1. Data Set
# of
users
# of
tweets
# of all
words
#of
stopwords
#words
(concise
list)
802
1,752,300
26,356,735
15,254,743
11,101,992
6. EVALUATION
During the evaluation process, we created interest based user
networks for each set of users by using the tweets they publish
and analyzed the basic network properties of each. We compare
these networks in terms of centrality, betweenness and degree of
their nodes below. In table 2, sample results are displayed.
Table 2. Evaluation of Data
5. IMPLEMENTATION
Proposed model in this study has been implemented in Sun Java 2
Platform version 5 [10] using Java programming language. We
used Eclipse IDE [11] software development environment to edit,
compile and debug our source code. Twitter exposes its data via
its API. In order to integrate the API method invocation with our
implementation in Java 2 Platform, we use an open source Java
library for the Twitter API: Twitter4j [12].
5.1 Data Set
In our thesis, we focus on users who publish content relevant to a
specific interest area. In the case of random user selection, there is
a high probability that we retrieve users who publish
conversational content when we consider the nature of Twitter
where most of the content is conversational. Our next alternative
approach is to select a set of users who we already know that they
have a relation in term of replies, follower or friends list, retweets
or mentions. By selecting such users who have relations, we could
easily apply our method and if we explore these relations
implicitly by using the content they publish we could easily verify
that our model is promising.
We assume that the users who have conversational relationships
have the potential to be:

Human users – not bots or spammers

Active users – that publish content instead of solely reading
them
These two criteria are important for us since we focus on
exploring relations between human users and also focus on users
who publish sufficient content in a specific area so that we can
analyze them. In our model, the more content provided by the
Network
Avr.
Centrality
Avr.
Degree
Avr.
Closeness
Avr.
Betweenness
User1
4
0.21324
0.28028
0.02021
User2
11
0.05882
0.09635
0.00171
User3
1
0.38462
0.43719
0.14204
We evaluate the networks based on the centrality measures which
are closeness, betweenness and degree in the social network
analysis. In addition we extract the central nodes of each network.
In our case, the higher values of centrality of the nodes imply that
a single user has common interests with many of the other users
and other users do not have common interests. By measuring
these metrics we aim to;

Compare networks based on the same criteria

See if the users are connected based on the content they
share (proof of our model)

Evaluate the relation between user behaviors and their
network properties

Find out central nodes and interest based clusters of users
The initial results of our evaluation is that our model finds the
relations between users based on the content shared by them and
without knowing any type of relation information between the
users. It connects the users who share similar content and isolates
the users who do not have any content in common with any of the
users in the network. A threshold can be set to isolate the users.
We set the threshold as five; therefore removed the lines with the
value lower than five in the sample network. Below in the figure a
sample network for the user “cforbesoklahoma” is shown.
Avr.
Centers Degree
Network
User
Avr.
Avr.
in the
Closeness Betweenness center? Nodes
SunnaGunnlaugs 16
0.01754
0.03098
0.00023
Yes
20
Ivan_Herman
0.13235
0.20827
0.00832
No
18
3
Figure 5. Connected and Isolated Users
While we capture the center nodes of each network based on their
common interests, it is shown that the original users in each
network not necessarily in the center of the networks. This means
that even though we selected our data based on reply relations of a
specific user, our model generates a network where user
connections are completely based on the content they share.
Central nodes extracted from the network of user “appstoresocial”
is show in the figure below where the input user “appstoresocial”
is not one of the central nodes.
Figure 8. Central Nodes in Ivan_Herman Network
Figure 6. All nodes for “appstoresocial” user network
Figure 7. Central Nodes of The “Appstoresocial” Network
The analysis of the networks also shows us that the lower values
of betweenness, degree and closeness of the network causes the
number of central nodes to increase. Comparison of these two
types of networks is given with a sample below:
Table 3. Network Measures Comparison for Two Sample
Networks
Figure 9. Central Nodes in SunnaGunnlaughs network
In order to see the interest areas of these central users, we inquired
the words network we have generated for each user. A sample
screen shot that displays the list of tags and the words associated
with the users “menuspony” and “kendall” is shown in the Table
4. In this case we expect the users in the network to be related to
the area of semantic web since the user “ivan_herman” is an
academic in the reality who provides content relevant to the
semantic web area. Hence, when we generate our words-userstags network for the user “ivan_herman” the tags he uses which
are relevant to the semantic web area would be extracted easily. In
addition, since we associate tags with users and tags with words,
the over all network would contain users relevant to the semantic
web area and words relevant to the semantic web concept. From
this perspective, we can say that the results produced by our
model using our test data are as we expected.
Table 4. Sample Co-occurrence Results
Tag
Co-Occurring Word
Co-Occ. Freq.
Html5
Rdfa
78
Linkeddata
Data
55
Semanticweb
Linkeddata
53
Semanticweb
Web
52
Semanticweb
Rdfa
24
Html5
Microdata
20
In the networks, generated by our model, where number of central
nodes is high, we see that users who are not connected in Twitter
by the follower or friends relations can be suggested to each other.
This point can be implemented as a future work of our research.
In summary we have evaluated our model by measuring the
centrality measures of each network. We show that the relations
between people can be extracted by associating the content they
publish. Besides, we show that a community of interest can also
be extracted by using the central nodes in the networks. In
addition, we demonstrate that our model is capable of expanding
the keywords which are specific to an interest area without
requiring users to use them. In other words, users are associated
with the tags directly but our model can easily find other relevant
keywords which may not be used by the users.
7. CONCLUSION
In this research, we have proposed and implemented a model to
explore interest area specific communities in microblogging
environments. We have introduced the user generated content and
its enabling technologies gathered under the concept of Web 2.0
technologies.
User generated content allow people share knowledge,
communicate and collaborate via social web applications which
gained impressive popularity in the last years. Among different
kinds of social web applications such as networking (Facebook,
MySpace), bookmarking (Del.icio.us, Citeulike), sharing videos
and photos (Flickr), blogging (FriendFeed, Mashable,
ReadWriteWeb) etc, we focus on microblogging application
Twitter.
Twitter differs from the social networking sites since the
relationships between its users are based on their interest areas.
Users follow other users who they think share knowledge in a
specific area. It also differs from other social web applications
where users share resources such as Flickr due to its structure
which do not allow many users collaboratively tag a resource
which is a short text message - not an image or bookmark.
However, it is a big problem today to find people who share
similar interests in Twitter and provide valuable content among
many irrelevant and conversational dialogs.
In our research we focus on discovering the interest area of users
by processing and analyzing the content they publish.
Furthermore, we explore relationships between people who share
similar interests by extracting the common keywords and other
related words associated with these words. By implementing our
model, we extract the communities associated with a specific
interest area and expand the tag cloud of the users so that they can
be searched not only by the words they use but also other
keywords relevant to this interest area.
7.1 Contributions
The major contribution of this thesis is to the problem of
discovery of users’ interest areas and finding similar users who
provide valuable content and information in a specific area in the
online social networks, specifically in the microblogging
environments. Our model is distinguished from other approaches
to find users in these networks since the relations we discover is
based on the users’ contributions instead of the explicit
information given by them such as location, biography, friends,
followers, number of updates etc. Since our basis for extracting
the relations is the content itself, we provide avoiding the
possibility of finding people who declare that they are interested
in a specific area but do not provide any valuable content in this
area.
In addition to these, we expand the tag cloud of the users by
discovering associations between keywords so that even if users
do not use a specific word, they can be inquired by all relevant
words in this area. Hence, similar people are matched by a set of
relevant words instead of using exact keyword match.
8. DISCUSSION & FUTURE WORK
We have initiated our research to find people who are interested
in a specific area by giving a specific keyword as input. However,
due to the time and resource constraints we narrowed down the
scope of our thesis to find the interest area of a given user and
other users who share similar interest. In addition we focused on
discovering;

the common interests between users

the users and communities who are similar to a given user in
terms of a specific interest area.
Throughout this thesis, we see that there are different directions
which we left as future work for our proposed model due to the
time and resource constraints. Here by, we explain these
alternative directions in detail.
Users who are interested in a specific area can be found by
searching a specific keyword. By associating the relevant words,
tags and users as we propose in our model, searching for users
who are associated with the set of relevant words in a specific area
can be found. This work on the other hand requires indexing and
ranking as the search engines do.
Another direction would be to implement a semantic reasoning
engine for the set of words we associate so that the meaning of the
interest area would be extracted. It is also possible to cluster the
words according to a classification or categorization algorithm so
that specific thresholds can be discovered and the less relevant
words in the words set could be eliminated.
Mapping the set of words we associated can also be mapped to
pre-defined ontologies such as ConceptNet and WordNet to
extract the relationships between them such as is-a, has-a
relations. This would also add a semantic view to the group of
words which we associated with each other.
Due to the resource limitations, we have not implemented the path
finding algorithms in the users-tags-words network in our
proposed model. Such an implementation would be an extension
to our model and basis for a user suggestion application.
In our thesis we applied our model on a set of users which we
know that they have reply relations. With this information in one
hand, we tried to find the relations between users based on a
specific interest area. However, applying further social analysis
algorithm such as clustering would expose the communities and
the relations between these communities based on our network
models for users-tags-words. Our model can be implemented by
processing all the users in a single network to discover
communities as a future work.
Our model requires the processing of all the tweets from all the
users. Hence, realistically it is not feasible to process all the data
and index all the users together with the information regarding
their interest areas. During the evaluation phase of this thesis, we
have faced performance problems due to the processing of large
sets of data. Therefore, alternative ways to get over the
performance problems in the large scale should be considered.
9. ACKNOWLEDGEMENTS
This work is partially funded by B.U. Research Fund
grant BAP 09HA102P.
10. REFERENCES
[1] O'Reilly, T., “What is Web 2.0”, Web 2.0 Conference, San
Francisco, CA, USA, 2004..
[2] Twitter, “Twitter Search”, http://search.twitter.com/
[3] Twitter
Suggested
Users
List,
http://blog.twitter.com/2009/03/suggested-users.html, 2009.
[4] Twitter, “Twitter”, http://twitter.com/
[5] Zarella, D., “Modeling ReTweet Dynamics”,
http://danzarrella.com/modeling-retweet-dynamics.html,
2009.
[6] Huberman, B. A. and Scott A. Golder, “The Structure of
Collaborative Tagging Systems”, HP Labs technical report,
2005.
[7] Tumblr, http://www.tumblr.com/.
[8] Wefollow, http://wefollow.com/, 2009.
[9] Schruz, F.D., “Glossary of Terms Used in Database
Searching”,
http://library.iusb.edu/instruction/helpguide/handouts/Databa
seSearching.shtml, 2009.
[10] Sun Microsystems, “Java Technology Reference”,
http://java.sun.com/reference/index.jsp, 2009.
[11] IBM Research, “Eclipse IDE - an open extensible Integrated
Development Environment (IDE)”, http://www.eclipse.org/,
2009.
[12] Yamamoto, Y., “Twitter4J - An open-sourced, mavenized
and Google App Engine safe Java library for the Twitter API,
released under the BSD license”,
http://yusuke.homeip.net/twitter4j/en/index.html, 2009.
[13] C. Cattuto, D. Benz, A. Hotho, G. Stumme, Semantic
Analysis of Tag Similarity Measures in Collaborative
Tagging Systems, 3rd Workshop on Ontology Learning and
Population OLP3, 2008
[14] Wettler, M. and R. Rapp, “Computation of Word
Associations Based on the Co-Occurrences of Words in
Large Corpora”, International Conference On Computational
Linguistics, Taipei, Taiwan, 2002
[15] Man Au Yeung, Ching and Noll, Michael and Gibbins,
Nicholas and Meinel, Christoph and Shadbolt,
Nigel (2009) On Measuring Expertise in Collaborative
Tagging Systems. In: Proceedings of the WebSci'09: Society
On-Line, 18-20 March 2009, Athens, Greece
[16] Twitterholic, “Top Twitter User Rankings & Stats”,
http://twitterholic.com, 2010
[17] Google, “Google Search Engine”, http://www.google.com/,
2010.
Download