Exploring Area-Specific Microblogging Social Networks Ece Aksu Degirmencioglu Suzan Uskudarli Bogazici University 34342 Bebek Istanbul, Turkey Bogazici University 34342 Bebek Istanbul, Turkey eceaksu@gmail.com uskudarli@cmpe.boun.edu.tr ABSTRACT Social networks can be used to find people who share similar interests or people who have knowledge in a specific domain. One method to find people is based on search by specific information about people or search by specific keywords they use. This method is limited to the explicit information provided by people. The other method is based on ranking people by their popularity to suggest a list of popular people for a given interest area. However, this method hides people who provide valuable content but not so popular in the network. In this paper, we examine what people contribute rather than what they declare about themselves. The idea is that their value depends on what they contribute. We propose a method for identifying interest areas of people without relying on the explicit information declared by them. Moreover, by applying our model, we identify interest-specific microblogger communities and extract sets of relevant words related to different interest areas. Keywords Interest Networks, User suggestion, Twitter, keyword cooccurrence, social network analysis, community extraction 1. INTRODUCTION Social applications in the Internet today allow people to interactively share their knowledge, resources such as video, text and images; collaboratively perform tasks such as updating documents, find friends and to communicate. The popularity of these applications in online social networks increased after the advent of Web 2.0 [1] technologies due to user generated content. Among different types of social applications, microblogging environments became the fastest growing type of applications with the introduction of Twitter in 2007. With its over 14 million users, Twitter’s growth has been declared as 1392 percent in 2009. Microblogging provides a very simple, short but efficient and quick way of spreading and retrieving information. It allows users to expose their ideas, feelings, interests, knowledge and expertise by means of short text messages. Users also interact through microblogs. To express the desired content in a space efficient manner, various space conserving conventions and notations have emerged. For example in Twitter, hashtags are tokens that start with a hash symbol (#) prefix. They are used to tag microblogs (tweets) they occur in. A microblogger’s contributions: - may relate to numerous topics - are fragmented thoughts into many microblogs - may be duplication of somebody else’s content - may contain references to other users, external links, tags etc. The nature of microblogs causes two major problems. First, it is very difficult to distinguish valuable content among all the contributions. Second, it is hard to find users who contribute actively and contribute valuable information to follow. Search tools for content and users are available for microblogging environments and specifically for Twitter. One of the two approaches is keyword matching. People are encouraged to search for specific keywords to apply keyword matching to find information which contains the given keywords. Keyword matching is used either to find content [2] or users [3]. When content is searched by keywords, it causes thousands of contributions from many different users to appear in the result set which make it very hard to filter out the irrelevant content. When users are searched by keywords, the result set depends on the explicit information declared by the users themselves. However, in the case that people’s declarations and their contributions are not aligned, misleading results are returned from queries by specific keywords. The other approach is popularity based ranking. With this approach, discovery of users who share relevant and valuable information is based on the popularity of users measured by quantitative variables such as the number of followers or contributions. This kind of search tools, leave the less popular but more valuable users hidden among millions of other users. In this study, we focus on finding users who are interested in a specific area by processing their microblog contributions. We also focus on finding the relations between users who build up a community around similar interests. This work aims to; (1) identify a microblogger’s interest given their contributions (2) identify a community of interest given a microblogger (3) examine social network communities of interest properties of microblogger It proposes first to process a collection of microblog contributions and reduce them into a set of keywords representing the nature of their content. Secondly, to identify a community of interest based on a user. In the following sections, the proposed model is described in detail and evaluated based on the data collected from Twitter. 2. BACKGROUND 2.1 Twitter Twitter[4] is free social networking and microblogging service which allows users to publish, share and retrieve content known as tweets. Twitter also supports video and image formats in addition to the text. In Twitter, short messages are called tweets and each tweet is limited to the 140 characters. Due to the shortness of the tweets, they contain abbreviations, short URLs, user names and special notations such as words prefixed with ‘#’ or ’@’ characters. The real content is usually given in the links external to Twitter. In some cases, the content may also be fragmented over several tweets. Users follow other users’ tweets which means that all the contributions from all other users they follow listed in a streaming line. Other than publishing tweets, users can follow other users in Twitter. Users are in follower role when they follow another user and listed in the followers list of the user they follow. The list of users who are being followed are named as friends. The tweets are broadcast to one direction in Twitter which means that the followers of a user are able to see the published tweets by that user. However, any other user who comes across with them in a search is also able to see any other user’s tweets as long as the security and privacy settings are set to public instead of private option. A reply is a special message sent from one user to another. It is distinguished from a normal tweet by the at sign (@) prefix of users. If a tweet begins with a @username, it is a reply. If the tweet has @username but not at the beginning of the tweet, it is considered as a mention in Twitter. Twitter displays the tweets which has @username on user’s home page. There is no requirement for users to be following other users in order to see the replies or mentions to them. While replies and mentions are broadcast publicly to all other users who are not intended as well, Twitter also allows users to send private messages, named direct messages, from one person to another. Direct messages can only be sent to the followers. Direct messages are out of our scope in this study since they are not publicly retrievable. ReTweet, in the social networking and microblogging service Twitter, to re-post something posted by another user, usually preceded with "RT" and "@username" to refer to the original poster. Retweets are used very frequently in twitter and has a dramatic influential effect on users [5] Tags can be considered as a bottom up approach for classification when compared to the taxonomies which are defined by experts for a limited set of items hierarchically with a top down approach. In taxonomy, there is one way to classify each item. However, tags can be classified in many different ways since it has a flat structure [6]. A special type of tags is hashtag which is used in microblogging systems such as Twitter and Tumblr [7]. Hashtags are the words or phrases with prefix hash sign (#) and with multi words concatenated. Throughout this study we refer to hashtags and tags interchangeably. Users in Twitter are represented with a unique screen name. They optionally enter their profile information such as name, biographic information, location and web site address. In order to organize users they follow, users create lists in Twitter. By means of lists people are grouped by specific subjects or Copyright is held by the authors. Web Science Conf. 2010, April 26-27, 2010, Raleigh, NC, USA. interest areas. However, it does not always show us that a user listed in a list publish valuable content in the subject matter of the list. On the other hand, users who are not listed in any of the lists, since lists are optional, may be producing more valuable content than the users we retrieve via lists. In our study, we keep, all explicit categorizations given by the users, out of our scope and focus on the content to explore the communities or user groups that come together around a specific area of interest. For this reason, we do not use lists as a parameter to discover the relationships based on interests. 3. RELATED WORK Social networks can be used to find people who share similar interests or people who have knowledge in a specific domain. Using social networks to share knowledge is a very efficient way of reaching information or simply finding answers to questions. There are two types of methods to discover people in microblogging environments. One method is based on search by specific information about people or search by specific keywords they use [2][17]. It is based on keyword search and limited to information explicitly declared by the users such as name, location, marital status, interests etc. Since the users often do not declare their interest areas and other information which may help finding them, it is usually a time consuming task to locate those who are of interest. The other method is based on ranking people by their popularity to suggest a list of popular people for a given interest area [3][8][16]. These people suggestion tools rank people by their popularity measured by the number of other people following them. However, popularity based methods make the popular people more populer while keeping the valuable but less popular ones hidden. In addition, popularity based ranking methods make it impossible to find correct people when the subject matter is to interact, communicate or simply ask questions to retrieve specific information. 4. PROPOSED MODEL We propose a method for identifying users interest areas and the communities based on these interests. Our approach aims to identify a microblogger’s interests by processing their contributions. It is assumed that, as long as users actively publish microblogs aligned with their interest areas, it is possible to identify these interest areas by processing their contributions. In addition, it is considered that hashtags are key elements of microblogs which help to distinguish specific areas of interest among conversational content. In our method, interest areas of users are identified by extracting the hashtags they use in their microblogs. Tag clouds are representations of all the tags assigned by a user in a form that the most frequent tags can be distinguished in a single view. Tag clouds of users are enriched by associating hashtags and nonhashtag words based on their co-occurrence in the same microblog. The important point here is that; - Once it is investigated that a word has been used as hashtag in any microblog from any of the users in the dataset, while investigating the co-occurrence relations, it is not necessarily be used in the form of hashtag in other microblogs from any user. - Enriching the tag clouds of a user is based on all microblogs from all users which means that, while investigating the co-occurrence relations, all hashtag and non-hashtags from all the users in the dataset are considered. Once the hashtags are associated with the users, the users who use common hashtags are also associated in order to extract communities of interests. Our method in general consists of two steps: 1-) Processing Microblogs 2-) Networks Analysis. These steps are shown in the Figure 1 below with the sub-steps in each. Input User Name Retrieve Tweets Twitter Retrieve all the words used as tag at least once Tweets Parse Tweets into Words Categorized Words Words Categorization Stop Words Links Tags Emotion Words General Words Time Related Words Mentions & Replies Punctuations Removal Find Tag-Word Co-ocurrences Find Users-Tags Relations Find CoOccurrence Freq Find User-Tag count Generate Words Network Generate Users Words Network Elimination Words & Tags Interest Area 1 Interest Area .... 2 Community 1 Community .... 2 Interest Area N Community N 4.1 Processing Microblogs Microblogs contain words or phrases which are relevant keywords for the interest areas of the user they are published by. However, microblogs also contain other words which are irrelevant to user interests. Examples of such words are ‘I think’, ’thanks’, ‘great’ etc. In addition to the words which are irrelevant to user interests, punctuation marks are also frequently used in microblogs. In our proposed model, we focus on discovering the words which are candidate keywords for interest areas of users who post the content. Hence, we aim to have a precise set of relevant words which give us clues to explore interest areas and communities attached to these interest areas. To have a precise set of relevant words, tokens are categorized first, then irrelevant categories are filtered out as described in the following paragraphs of this section. The processing algorithm performs the following steps: Parsing: All microblogging contributions are parsed resulting in a bag of tokens. Tokens Categorization: The tokens are categorized into the following categories in this step. By means of categorization, it is possible to eliminate the irrelevant ones in the next steps. Network Generation Pre-Processing Tokens are categorized as stop words to be able to eliminate them in the following steps of our algorithm since they do not contain relevant or significant information to be used to discover interest areas of users. Figure 1. Proposed Model User text contributions are processed in the following manner; Stop Words: Stop words are the words either insignificant such as prepositions, articles etc. or so common that they can not be related to any specific interest area [9]. They vary from language to language and system to system. Hereby, we refer to the stop words defined in English. Links: Microblogs are short messages. Due to the character limitation of microblogging systems, microblogs contain links where users can navigate to other web sites external to the microblogging system. By means of links, details of the information given in the microblog can be retrieved. In our model, we categorized all the words starting with the character set “http” as links. Hashtags: Hashtags are words in microblogs with a hash sign “#” prefix. Due to the character limitation of microblogs, hashtags are used to allow users to give brief information about the content of the links, pictures or simply the text they publish. Hashtags also used to search for what users had posted previously. Any set of users are selected. Such as; o A set of randomly selected users o A set of users in a list o A set of users who are already known that they have conversations All microblogs published by these selected users in a microblogging environment are retrieved, Microblogs are parsed into tokens, Tokens are categorized into the categories defined in the Processing Microblogs following in this Chapter, Tokens which are defined in the Elimination part in the Processing Microblogs following in this Chapter are filtered out, Interest areas of users are defined by associating the users and hashtags they use, Tag clouds of users are enriched by associating hashtags to other words they co-occur in the same tweet, Interest based user networks are generated by associating users who use common hashtags in their microblogs, These steps are explained in detail in the next sections. In our model, hashtags are associated with the users to identify their interest areas. It is assumed that if a word is used as hashtag in a microbog, it is more valuable than other words used in the microblogs in terms of relevancy to the interest areas of users. Mentions & Replies: The tokens are categorized as mentions if they have at sign “@” prefix and not in the first order in the microblog. Example: Leaving house for first time since Xmas eve. Excuse to wear red duffle coat from sister and sparkly scarf from @evwa! @evwa is a user who is mentioned in this tweet. The tokens are categorized as replies if they have at sign “@” prefix and in the first order in the microblog. Example: @delta_goodrem Hope u had a Beautiful xmas Delta! @delta_goodrem is a user who is replied in this tweet. In our method, all the tokens containing “@” character at the beginning are categorized either as mentions or replies. However, it can not be certain that it is really a reply or a mention since there is no specific rule or any restriction for users to use this notation in microblogs. It can only be said that such tokens are user names. In our algorithm, all the mentions and replies are eliminated since the focus is on the interest area relevant keywords. General Words: Based on microblogging systems dynamics, it is commonly seen that users refer to the microblogging application itself as well. Such common words are specified in our model and eliminated. Also other common words which are used frequently but not treated as stop words are categorized as general words. The selected words in this category are: General Words= {“twitter”, “RT”, “RT@”,”tweet”, “free”, “check”} Emotion Words: When people post microblogs, they add their feelings as well. Tokens which express feelings of users are categorized as emotional words. Emotion words are eliminated in our model. Emotional Words={“good”, “big”, “awesome”, “amazing”, “fine”, “nice”, “bad”, “beautiful”, “love”, “enjoy”, “hate”} hashtag are associated by using the method of co-occurrence. Finally we explain how we define associations between users based on common interests they have. 4.2 Network Analysis In the second step of our model, we have three main objectives; to discover a set of words relevant to an interest area , to discover the interest areas of users and to generate interest based communities. Relations among users and their interest areas are represented in the form of different types of network graphs. In the following sections, these networks are explained in detail. 4.2.1 Relevant Words Network This network aims to present the set of relevant words specific to an interest area. The relations between words allow us; to discover and categorize the set of words which belong to a specific interest area to navigate through related interest areas enrich the tag cloud of each user so that users can be searched by other words which are not used by them but relevant to the interest area of that specific users A sample representation of this type of network is shown in the Figure 2. Time Related Words: Tokens describing the time of the events or information given is frequently used in microblogs. These tokens are not in our focus since they are common words which are not specifically related to any interest area. Though there are many time related words in English language, the most common ones are defined as time related words as follows in our model: T= {“today”, “tomorrow”, “now”, “2009”} Punctuation Removal: All non-alphanumeric characters are removed from the tokens extracted in the parsing step (i.e. “favorite!!” becomes “favorite”). Alphanumeric characters are the set of numbers 0 to 9 and letters A to Z. All the characters which are not contained in this set are non-alphanumeric characters. Figure 2. Sample Network of Relevant Words Punctuation removal is done after the categorization of the tokens. The reason for this is that the categorization is based on the special notation in microblogs which help identifying the types of the tokens (i.e. hashtags, mentions, links etc.). It has two types of nodes which are; Elimination: The last step of processing microblogs is the elimination where irrelevant tokens to any interest area are filtered out to have a concise list of words which are expected to be more relevant to any specific interest area. The set of concise words are used to explore interest areas of users and associations between users based on these interest areas in the next steps of the model. A word may possibly be used as hashtag in a microblog but the same word may also possibly not be used as hashtag in another microblog. In this case, the word is counted as an element of the first set T. The tokens in the following categories are filtered out in this step: links, stop words, emotion words, general words, time related words, mentions and replies. The final set contains either tags or non-tags words. In the following sections, we explain how interest areas of users are identified by associating users and hashtags they use in their microblogs. Then we explain how other words relevant to a T = { t1 ,t2 ,t3 ,…. ,tn } A set of hashtags W = {w1 ,w2 ,w3 ,…. ,wm} A set of other words which are not hashtags If a hashtag t in T co-occur with a word w in W in any of the microblogs at least once, there is an edge between the hashtag t and the word w. The weight of the edge is the number of microblogs that they co-occur. Edge weights give us directly the co-occurrence relatedness of words. - E = { (t, w) | t ∈ T and w ∈ W ∧ t ≠ w} Jane Weight(e) = Mike To simulate a sample case when calculating the weights, a basic example is given here; John Figure 3. Sample Network of Users-Interest Microblog1 = “Reading papers about OMG’s #MDA to see what work has been done on PIM->PSM transformation.” It has two types of nodes which are; Microblog2= “MDA and OMG are related concepts” A set of users who uses at least one hashtag in at least one of their microblogs U = {u1 ,u2 ,u3 ,…. ,um} A set of hashtags used any of the users in U at least once T = { t1 ,t2 ,t3 ,…. ,tn } In the first microblog, “mda” is a hashtag. After the hashtag “mda” is identified as an element of T, all microblogs containing “mda” either as hashtag or non-hashtag is retrieved. For this reason, by counting another microblog where “mda” is not a hashtag in it, the frequency of the co-occurrence of “MDA” and “OMG” words are assigned as two since they co-occur in both of the microblogs. The set W does not contain any words which are in the set T in order to avoid connection between the same word which is used as hashtag in one microblog and not used as hashtag in another microblog. T ∩W = {} Based on the co-occurrence of two words in the same microblog, we generate a network which shows us the relevancy between all the tokens which are output of the elimination step of the processing microblogs. Hashtags are different than non-hashtag words since they give precise information about the content hence the interest areas. Instead of trying to find all co-occurrences of all words in all contributions, hashtags are considered as hubs and words which co-occur with hashtags are associated with these hubs. By doing this we filter out the irrelevant words such as conversational ones and assume that the words co-occur with the hashtags are possible candidates to be keywords for interest areas even if they are not used as hashtags in the microblogs. Based on the approache explained above, the following steps are executed to generate the relevant words network. The assumption here is that hashtags are key elements to discover the interest areas of users. And users who do not use any hashtags in their microblogs cannot be associated with any specific interest area by using our model. If a hashtag t ∈ T occurs at least once in all microblogs posted by the user u ∈ U, there is an edge between the hashtag t and the user u. In order to evaluate the strength of the edges hence the relation between a user u and the hashtag t, we assign weights w(eu) to the edges. EU = {(u ,t) | t ∈ T and u ∈ U} Weight(eu) = This network is used to generate user networks based on the hashtags they use in common. In the next section we give the formal definition of the interest based user networks. 4.2.3 Interest Based User Networks Step1: Retrieve all the words used as hashtag at least once Step2: Find all the words co-occur with the words found in the first step at least once discover communities based on specific areas of interest Step3: Find frequencies of each co-occurrence navigate through users who are connected based on an interest Step4: Generate words network A sample representation of this type of network is shown in the Figure 4. Relevant words network identifies the interest areas in a given set of microblogs. Once the interests are extracted, the next step is to associate these interest areas with the users. This network aims to present the set of users who are connected based on an interest area. The relations between users allow us to; In the following section, we describe how the users and their interest areas are associated in detail. 4.2.2 Users – Interest Areas Network This network aims to present the relations between users and their interest areas. Users are associated with the hashtags they use at least once in any of their microblogs. A sample representation of this type of network is shown in the Figure 3. It has one type of node which is; A set of users who uses at least one hashtag in at least one of their microblogs U = {u1 ,u2 ,u3 ,…. ,um} The set of hashtags which are used in common by at least two users are also defined as T = { t1 ,t2 ,t3 ,…. ,tn } in this network. However, they are not presented in the network graph as another type of node. The assumption here is that users who have common interests tend to use the same hashtags. The more two users use the same hashtags in their microblogs, they are more likely to share similar interest areas. We define interest based user networks as a weighted undirected graph. We assume that users who use a hashtag t in T at least once are represented as u in U. We define undirected edges between users e in E if two users use the same tag at least once. In order to evaluate the strength of the edges hence the relation between two users, we assign weights w(e) to the edges. The number of common tags two users use in their tweets gives us the value of the weight of the edge between them. E={( un, um) | u ∈ U} Weight(e) = user, the more we explore their interest areas and relations in terms of these interests. As a result of our decision, we selected 50 users from the Wefollow [8] web site where the users are categorized into specific areas and ranked by their popularity. Then we selected other users who are replied the most by each of these 50 users. By ranking the users who are replied we selected 20 most replied ones and collected their tweets. The size of our data is given in the table below: Table 1. Data Set # of users # of tweets # of all words #of stopwords #words (concise list) 802 1,752,300 26,356,735 15,254,743 11,101,992 6. EVALUATION During the evaluation process, we created interest based user networks for each set of users by using the tweets they publish and analyzed the basic network properties of each. We compare these networks in terms of centrality, betweenness and degree of their nodes below. In table 2, sample results are displayed. Table 2. Evaluation of Data 5. IMPLEMENTATION Proposed model in this study has been implemented in Sun Java 2 Platform version 5 [10] using Java programming language. We used Eclipse IDE [11] software development environment to edit, compile and debug our source code. Twitter exposes its data via its API. In order to integrate the API method invocation with our implementation in Java 2 Platform, we use an open source Java library for the Twitter API: Twitter4j [12]. 5.1 Data Set In our thesis, we focus on users who publish content relevant to a specific interest area. In the case of random user selection, there is a high probability that we retrieve users who publish conversational content when we consider the nature of Twitter where most of the content is conversational. Our next alternative approach is to select a set of users who we already know that they have a relation in term of replies, follower or friends list, retweets or mentions. By selecting such users who have relations, we could easily apply our method and if we explore these relations implicitly by using the content they publish we could easily verify that our model is promising. We assume that the users who have conversational relationships have the potential to be: Human users – not bots or spammers Active users – that publish content instead of solely reading them These two criteria are important for us since we focus on exploring relations between human users and also focus on users who publish sufficient content in a specific area so that we can analyze them. In our model, the more content provided by the Network Avr. Centrality Avr. Degree Avr. Closeness Avr. Betweenness User1 4 0.21324 0.28028 0.02021 User2 11 0.05882 0.09635 0.00171 User3 1 0.38462 0.43719 0.14204 We evaluate the networks based on the centrality measures which are closeness, betweenness and degree in the social network analysis. In addition we extract the central nodes of each network. In our case, the higher values of centrality of the nodes imply that a single user has common interests with many of the other users and other users do not have common interests. By measuring these metrics we aim to; Compare networks based on the same criteria See if the users are connected based on the content they share (proof of our model) Evaluate the relation between user behaviors and their network properties Find out central nodes and interest based clusters of users The initial results of our evaluation is that our model finds the relations between users based on the content shared by them and without knowing any type of relation information between the users. It connects the users who share similar content and isolates the users who do not have any content in common with any of the users in the network. A threshold can be set to isolate the users. We set the threshold as five; therefore removed the lines with the value lower than five in the sample network. Below in the figure a sample network for the user “cforbesoklahoma” is shown. Avr. Centers Degree Network User Avr. Avr. in the Closeness Betweenness center? Nodes SunnaGunnlaugs 16 0.01754 0.03098 0.00023 Yes 20 Ivan_Herman 0.13235 0.20827 0.00832 No 18 3 Figure 5. Connected and Isolated Users While we capture the center nodes of each network based on their common interests, it is shown that the original users in each network not necessarily in the center of the networks. This means that even though we selected our data based on reply relations of a specific user, our model generates a network where user connections are completely based on the content they share. Central nodes extracted from the network of user “appstoresocial” is show in the figure below where the input user “appstoresocial” is not one of the central nodes. Figure 8. Central Nodes in Ivan_Herman Network Figure 6. All nodes for “appstoresocial” user network Figure 7. Central Nodes of The “Appstoresocial” Network The analysis of the networks also shows us that the lower values of betweenness, degree and closeness of the network causes the number of central nodes to increase. Comparison of these two types of networks is given with a sample below: Table 3. Network Measures Comparison for Two Sample Networks Figure 9. Central Nodes in SunnaGunnlaughs network In order to see the interest areas of these central users, we inquired the words network we have generated for each user. A sample screen shot that displays the list of tags and the words associated with the users “menuspony” and “kendall” is shown in the Table 4. In this case we expect the users in the network to be related to the area of semantic web since the user “ivan_herman” is an academic in the reality who provides content relevant to the semantic web area. Hence, when we generate our words-userstags network for the user “ivan_herman” the tags he uses which are relevant to the semantic web area would be extracted easily. In addition, since we associate tags with users and tags with words, the over all network would contain users relevant to the semantic web area and words relevant to the semantic web concept. From this perspective, we can say that the results produced by our model using our test data are as we expected. Table 4. Sample Co-occurrence Results Tag Co-Occurring Word Co-Occ. Freq. Html5 Rdfa 78 Linkeddata Data 55 Semanticweb Linkeddata 53 Semanticweb Web 52 Semanticweb Rdfa 24 Html5 Microdata 20 In the networks, generated by our model, where number of central nodes is high, we see that users who are not connected in Twitter by the follower or friends relations can be suggested to each other. This point can be implemented as a future work of our research. In summary we have evaluated our model by measuring the centrality measures of each network. We show that the relations between people can be extracted by associating the content they publish. Besides, we show that a community of interest can also be extracted by using the central nodes in the networks. In addition, we demonstrate that our model is capable of expanding the keywords which are specific to an interest area without requiring users to use them. In other words, users are associated with the tags directly but our model can easily find other relevant keywords which may not be used by the users. 7. CONCLUSION In this research, we have proposed and implemented a model to explore interest area specific communities in microblogging environments. We have introduced the user generated content and its enabling technologies gathered under the concept of Web 2.0 technologies. User generated content allow people share knowledge, communicate and collaborate via social web applications which gained impressive popularity in the last years. Among different kinds of social web applications such as networking (Facebook, MySpace), bookmarking (Del.icio.us, Citeulike), sharing videos and photos (Flickr), blogging (FriendFeed, Mashable, ReadWriteWeb) etc, we focus on microblogging application Twitter. Twitter differs from the social networking sites since the relationships between its users are based on their interest areas. Users follow other users who they think share knowledge in a specific area. It also differs from other social web applications where users share resources such as Flickr due to its structure which do not allow many users collaboratively tag a resource which is a short text message - not an image or bookmark. However, it is a big problem today to find people who share similar interests in Twitter and provide valuable content among many irrelevant and conversational dialogs. In our research we focus on discovering the interest area of users by processing and analyzing the content they publish. Furthermore, we explore relationships between people who share similar interests by extracting the common keywords and other related words associated with these words. By implementing our model, we extract the communities associated with a specific interest area and expand the tag cloud of the users so that they can be searched not only by the words they use but also other keywords relevant to this interest area. 7.1 Contributions The major contribution of this thesis is to the problem of discovery of users’ interest areas and finding similar users who provide valuable content and information in a specific area in the online social networks, specifically in the microblogging environments. Our model is distinguished from other approaches to find users in these networks since the relations we discover is based on the users’ contributions instead of the explicit information given by them such as location, biography, friends, followers, number of updates etc. Since our basis for extracting the relations is the content itself, we provide avoiding the possibility of finding people who declare that they are interested in a specific area but do not provide any valuable content in this area. In addition to these, we expand the tag cloud of the users by discovering associations between keywords so that even if users do not use a specific word, they can be inquired by all relevant words in this area. Hence, similar people are matched by a set of relevant words instead of using exact keyword match. 8. DISCUSSION & FUTURE WORK We have initiated our research to find people who are interested in a specific area by giving a specific keyword as input. However, due to the time and resource constraints we narrowed down the scope of our thesis to find the interest area of a given user and other users who share similar interest. In addition we focused on discovering; the common interests between users the users and communities who are similar to a given user in terms of a specific interest area. Throughout this thesis, we see that there are different directions which we left as future work for our proposed model due to the time and resource constraints. Here by, we explain these alternative directions in detail. Users who are interested in a specific area can be found by searching a specific keyword. By associating the relevant words, tags and users as we propose in our model, searching for users who are associated with the set of relevant words in a specific area can be found. This work on the other hand requires indexing and ranking as the search engines do. Another direction would be to implement a semantic reasoning engine for the set of words we associate so that the meaning of the interest area would be extracted. It is also possible to cluster the words according to a classification or categorization algorithm so that specific thresholds can be discovered and the less relevant words in the words set could be eliminated. Mapping the set of words we associated can also be mapped to pre-defined ontologies such as ConceptNet and WordNet to extract the relationships between them such as is-a, has-a relations. This would also add a semantic view to the group of words which we associated with each other. Due to the resource limitations, we have not implemented the path finding algorithms in the users-tags-words network in our proposed model. Such an implementation would be an extension to our model and basis for a user suggestion application. In our thesis we applied our model on a set of users which we know that they have reply relations. With this information in one hand, we tried to find the relations between users based on a specific interest area. However, applying further social analysis algorithm such as clustering would expose the communities and the relations between these communities based on our network models for users-tags-words. Our model can be implemented by processing all the users in a single network to discover communities as a future work. Our model requires the processing of all the tweets from all the users. Hence, realistically it is not feasible to process all the data and index all the users together with the information regarding their interest areas. During the evaluation phase of this thesis, we have faced performance problems due to the processing of large sets of data. Therefore, alternative ways to get over the performance problems in the large scale should be considered. 9. ACKNOWLEDGEMENTS This work is partially funded by B.U. Research Fund grant BAP 09HA102P. 10. REFERENCES [1] O'Reilly, T., “What is Web 2.0”, Web 2.0 Conference, San Francisco, CA, USA, 2004.. [2] Twitter, “Twitter Search”, http://search.twitter.com/ [3] Twitter Suggested Users List, http://blog.twitter.com/2009/03/suggested-users.html, 2009. [4] Twitter, “Twitter”, http://twitter.com/ [5] Zarella, D., “Modeling ReTweet Dynamics”, http://danzarrella.com/modeling-retweet-dynamics.html, 2009. [6] Huberman, B. A. and Scott A. Golder, “The Structure of Collaborative Tagging Systems”, HP Labs technical report, 2005. [7] Tumblr, http://www.tumblr.com/. [8] Wefollow, http://wefollow.com/, 2009. [9] Schruz, F.D., “Glossary of Terms Used in Database Searching”, http://library.iusb.edu/instruction/helpguide/handouts/Databa seSearching.shtml, 2009. [10] Sun Microsystems, “Java Technology Reference”, http://java.sun.com/reference/index.jsp, 2009. [11] IBM Research, “Eclipse IDE - an open extensible Integrated Development Environment (IDE)”, http://www.eclipse.org/, 2009. [12] Yamamoto, Y., “Twitter4J - An open-sourced, mavenized and Google App Engine safe Java library for the Twitter API, released under the BSD license”, http://yusuke.homeip.net/twitter4j/en/index.html, 2009. [13] C. Cattuto, D. Benz, A. Hotho, G. Stumme, Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems, 3rd Workshop on Ontology Learning and Population OLP3, 2008 [14] Wettler, M. and R. Rapp, “Computation of Word Associations Based on the Co-Occurrences of Words in Large Corpora”, International Conference On Computational Linguistics, Taipei, Taiwan, 2002 [15] Man Au Yeung, Ching and Noll, Michael and Gibbins, Nicholas and Meinel, Christoph and Shadbolt, Nigel (2009) On Measuring Expertise in Collaborative Tagging Systems. In: Proceedings of the WebSci'09: Society On-Line, 18-20 March 2009, Athens, Greece [16] Twitterholic, “Top Twitter User Rankings & Stats”, http://twitterholic.com, 2010 [17] Google, “Google Search Engine”, http://www.google.com/, 2010.