Event Panning in a Stream of Big Data Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Václav Belák, Conor Hayes DERI, National University of Ireland Galway {first.last}@deri.org Abstract In this paper, we present a hands-on experience report from designing and building an architecture for preprocessing & delivering real-time social-media messages in the context of a large international sporting event. In contrast to the standard topic-centred approach, we apply social community analytics to filter, segregate and rank an incoming stream of Twitter messages for display on a mobile device. The objective is to provide the user with a “breaking news” summary of the main sources, events and messages discovered. The architecture can be generally deployed in any context where (mobile) information consumers need to keep track of the latest news & trends and the corresponding sources in a stream of Big Data. We describe the complete infrastructure and the fresh stance we took for the analytics, the lessons we learned while developing it, as well as the main challenges and open issues we identified. 1 Introduction Although large-scale data analytics as well as stream processing and mining are well established research domains, they recently experience an increased attention in the advent of the Big Data movement. While some of this attention might be just due to hype, the characteristics of what is commonly understood as Big Data analytics pose a wide range of interesting and in parts novel challenges. People often focus on issues of data management and processing when considering Big Data. However, there are as many untackled challenges with respect to knowledge discovery and data mining at the heart of the actual analytics. This is particularly true in the context of Big Data streams, for which many of the usually considered approaches, such as the MapReduce paradigm, are not directly adoptable. In this paper, we report on our own experience from a “hands-on” exercise on the quest of discovering knowledge from one of the largest Big Data streams publicly available these days, Twitter. We collected these experiences in the context of creating a mobile application for the final stop of the Volvo Ocean Race 2012 in Galway, Ireland. For Galway, a city of ca. 75,000 citizens, this was a major event with more than 900,000 visitors during one week. We asked ourselves the question: How can we leverage the experience of the visitors so that they get the most from the event and their time in Galway? Twitter was quickly identified as probably the best, fastest and most multifaceted source of information for this purpose. But what is needed to extract, analyse, prepare and present the most relevant information from this huge social-media stream for such a particular and – compared to the overall size of the Twittersphere – rather small and local event? This work summarises our architecture spanning from a raw Big Data stream over large-scale and streaming analytics to the enduser mobile application. While we focus on the core analytics and the therein novel approach we took, we briefly cover all main aspects of the whole framework. As such, we also touch areas of information retrieval, Web search, personalisation and ubiquitous access on mobile devices. The widespread use of social media on internet-enabled mobile devices has created new challenges for research. As users report their experiences from the world, information consumers are faced with an increasing number of sources to read in real time. Furthermore, the small screen size of mobile devices limits the amount of information that can be shown and interacted with. In short, social media users on mobile devices are faced with increasing amounts of information and less space to consume it. Thus, finding, analysing and presenting information related to one particular event of, although high, but still mostly local relevance, is comparable to the search of the needle in the haystack. The intuitive approach that comes into mind is to mine the stream for particular topics of interest, and try to pick those that are of high relevance. However, this is prone to miss particular information, as large dominating topics will often overshadow the rather small ones. Moreover, it creates a mix of content that comes from many different and often unrelated groups of users, which might obscure particular facets of the groups’ messages. For these reasons, we decided to take a fresh stance on this problem. Our approach is based on several observations in literature [Kwak et al., 2010; Glasgow et al., 2012; Cha et al., 2010] that characterise Twitter as a novel mix between news media and social network. As such, we hypothesise that news and topics of particular interest will be amplified by groups of users that reveal close and frequent communication relations on Twitter. Thus, we argue that by mining for closely connected groups and analysing their topics, we actually search for the easier (but still challenging) to spot pin cushion rather than the single needle. To use another metaphor, we compare this task to the panning for gold, and propose to: (i) pan only in the best part of the Big Data stream by applying a set of complex filters that benefit from knowledge discovery and machine learning, (ii) use network analysis to achieve successful screening to identify the truly valuable nuggets of information, (a) Main Screen (b) Tweets Screen (c) Group Screen Figure 1: Screens of the mobile app (iii) apply social ranking to bring the most valuable information to the highest attention. The contribution of the work in hand is a summary report from this hands-on experience. We discuss the major issues we encountered, the expected and unexpected challenges we had to face and the data mining approaches we integrated as well as those that we identified as potentially beneficial for further improvement. We expect that this experience report can help others in designing & developing applications based on advanced Big Data stream analytics. 2 Application Overview We developed our application as part of the official mobile application for the Volvo Ocean Race 2012 in Galway, Ireland, which was completely created by DERI, NUI Galway.1 After more than three months of hard work, several major and minor changes to our approach and some sleepless nights, we were happy to finish Tweet Cliques2 , a small but smart app that shows what’s up and hot around the event on Twitter. A recent blogpost3 nicely summarises a main issue of Big Data, whose truth we could identify nearly every day: it’s not just the analytics. All surrounding tasks, such as filtering, pre- and post-processing, are of equal importance. In this section, we give a complete overview of all the resulting crucial ingredients and the lessons we learned, before we focus on the actual analytical backbone in Section 3. Figure 1 shows our final product, the mobile application. On the main screen, we show the groups we found and the most recently observed tweets from the most central group members. On the second screen, we show the most recently observed tweet from each group member, ranked by the importance of the users. Finally, the third screen shows the users and the computed “Clique Score” (used for ranking, see Section 3.3) as well as the number of outgoing and incoming links of the users in the group. 2.1 Architecture The main architecture is shown in Figure 2. It comprises several interconnected components, from the live Twitter 1 http://www.deri.ie/about/press/coverage/details/ ?uid=265&ref=214 2 3 Information and download: http://uimr.deri.ie/tweet-cliques http://wp.sigmod.org/?p=430 stream up to the final end-user mobile application. The core idea is to process tweets that pass the filter continuously and perform analytics to extract community structures and topics in a batch mode. However, incoming tweets are always classified to the according groups and delivered in real time. The architecture is fully modular with a clear separation of the different processing logics involved. The system considers three kind of data flows for processing: • Streaming flow: data is processed continuously and endlessly, using a publisher/subscriber distributed model. The communication channel is always active and data processing is totally event-driven. This flow is able to deliver events to multiple components at the same time. • System-triggered flow: data is processed at regular intervals, controlled by known system timers (i.e., window slides). Data transport is only active when requested and fixed amount of data is passed between components in a point-to-point strategy. • On-demand flow: very similar to the above flow, with the key difference that this is triggered by external events, such as users starting new sessions. This flow is also point-to-point, active upon requests and data flows in fixed amounts. Two different databases provide the storage backbone for the application: • Graph database: This store is used for real-time creation of relationship networks (mentions, retweets, replies and hashtag referrals). Time-stamped nodes and edges are inserted into the database on the fly after carrying out an according entities extraction. • Relational database: this storage keeps the relevant tweets’ content and the ranked community structures (including users and labels). Content is inserted as soon as tweets arrive, whereas community data is inserted at regular intervals (system triggered) from the analytics component. We describe the two main components, the Stream Filter and the Analytics, in the following sections. The Tweets Classifier works on the data provided by the Analytics component and assigns tweets to groups. The communication between the Mobile App and the backend is based on a Twitter Raw Tweets Stream Filter Relevant Tweets Tweets Classifier Entities Extraction Feedback: Users to Follow Initial Tweets & Communities Data Tweets, Topics & Communities Mobile App Signal: Updated Communities Data Relationships GraphDB RelationalDB Ranked Communities, Users and Labels Networks Slices Data Flow Types Streaming Flow Analytics System Triggered Flow On-­‐demand Flow Figure 2: Architecture overview Web service, which allows to run the application on a wide range of different mobile platforms. 2.2 Stream Filtering The Twitter public stream APIs4 provide real-time access to Twitter data by following specific users and topics. Due to these mere technical reasons, and concerns about scalability, we knew from the beginning that we had to filter out a particular part from the global Twitter stream. However, we quickly learned that filtering is not only crucial because of these reasons. Rather, the trade-off between too much noise in the data (i.e., too relaxed filters) and sparsity of input (i.e., too strict filters) has a crucial impact on the results gained from the actual analytics. Consequently, the filtering component grew with every day into a complex system of different techniques (see Figure 3), which still bears a lot of potential for improvement. Analytics Twitter Users to Follow Raw Tweets Streaming API Search API Extended Hashtags Tweets for Hashtags Filter Input Seed Users & Keywords Burst Checker Extended Users Extender Trending Topics Past 4 hours Words Blacklist Hashtags Greylist Co-­‐occurrence Tables User/Hashtag & Hashtag/Hashtag Filter Core Co-­‐occurrence Computation Raw Past Tweets Extension Logic Relevant Tweets Output Stream: Relevant Tweets Figure 3: Overview of the filtering component We make use of two parameters that the stream APIs support to retrieve tweets from the public stream: Seeds As a starting point, we manually chose a number of seed users and keywords that we identified as highly relevant to the Volvo Ocean Race and which we followed permanently. If we want to filter particularly for hashtags (commonly understood as topic labels on Twitter), we precede a keyword by the ‘#’ sign. This is important for some hashtags like ‘#vor’, which would return a large amount of unrelated tweets if used directly as a keyword. Further, one filter keyword can actually contain several words, which returns all tweets containing each of the words at any position in its text. In our current implementation, the addition of keywords, hashtags and users to the seed lists is mainly a manual and subjective process. Seed users were initially chosen by analysing Twitter lists relevant to the event. We picked users from these lists and their followers, i.e., a 2-hop user collection. Then we cleaned the resulting list manually. However, soon after having the first set of filters in place, we recognised that the overlap between the set of users actually using the seed hashtags and the set of users from the Twitter list approach is rather small, which might be due to a rather unpopular and ad-hoc usage of Twitter lists. Thus, we decided to manually check the descriptions and latest tweets of all seen users in order to determine if they are relevant or if they are mostly doing advertisement. Additional seed users are also found by manually checking the users that another seed user is following. This method was, for instance, used to find the list of Galway bars, as there was no according Twitter list in place. Other social media, e.g., boards.ie5 , can also be used to find good seed users. The seed terms are found by looking at historical hashtags for the type of event, searching Twitter for any currently used hashtags and also pre-emptively guessing seed keywords and seed hashtags that may be used in the future. Before a term is added to the seed list as keyword or hashtag only, its search results are manually checked in Twitter to determine the type of results that it is likely to generate. This helps to estimate the level of noise that the addition of the term will generate. 1. Follow: a list of user IDs 2. Track: a list of keywords In the following, we describe the two main building blocks of the filtering component: a set of filter inputs that define what we actually fetch from the stream (and filter out again, e.g., spam) and an extension logic aiming at a dynamic maintenance of these filter inputs. 4 https://dev.twitter.com/docs/api/1/post/statuses/ filter Extension Logic Hashtags in Twitter are very heterogeneous, as they may be freely chosen by users and can refer to particular topics in many different ways. As such, it is impossible to cover all related users and keywords in the a priori created lists. Therefore, we try to identify other relevant keywords and users by analysing the tweets we retrieve based on the currently active filters. Then, we extend 5 http://www.boards.ie/vbulletin/showthread.php?t= 2056640325 the seed lists, but only for a certain amount of time. After that time we re-evaluate the extension. Like this, extended keywords and users are included into the filters only for the time that we consider them relevant based on our extension logic. We extend periodically based on the following approaches: 1. Hashtag co-occurrence: We find hashtags that cooccurred with seed hashtags and extend by those that co-occur more frequently than a predefined threshold (in fact, a mix of an absolute and a relative threshold). We restrict this co-occurrence to the seed hashtags only in order to avoid a drift to unrelated topics. 2. Hashtag-user co-occurrence: We extend the user list by users who frequently use seed hashtags. 3. We follow users that are identified as central in their community by the analytics component (see Section 3.3). We do this in order to always retrieve the most recent tweets from these important users. For example, tweets about a concert of the band “Thin Lizzy” may contain both, #volvooceanrace and #thinlizzy, but also #tlizzy, #lizzy, etc. – impossible to cover all possibilities. Moreover, these hashtags are relevant for our app only in a limited time window. This temporary importance is covered by the extension based on hashtag cooccurrence. This type of co-occurrence checking is related to detecting frequent itemsets in the stream of incoming tweets. We experimented with an according implementation provided by the stream data management system AnduIN [Klan et al., 2011]. However, we found that hashtags are actually occurring more frequently alone or in pairs than in larger sets. We are nevertheless planning to integrate this streammining method closer into our extension logic. Trending Topics The trending topic facility on Twitter caters for those interested in the most frequently occurring hashtags and keywords overall in Twitter. Users can query individual hashtags and get a listing of all quoting posts. However, this does not solve the problem of filtering and ranking and selecting these posts for presentation on a mobile device. Nevertheless, we can benefit from the list of up-to-date trending topics for maintaining the seed lists, but also to feed the other filter inputs, as we describe next. Blacklist It is unavoidable to retrieve a lot of unrelated and in fact unwanted tweets from the public stream, such as porn-related tweets. In order to filter these out again, we built a blacklist containing words and phrases that we can safely correlate with such noise. Any tweet containing an entry from the blacklist is discarded. Further, we calculate a spam score S for each retrieved tweet, inspired by [Cha et al., 2010]. We consider the number of hashtags h and the number of trending topics t appearing in a tweet and also the account age a of the user (number of days since the account created). (h + 1) · (t + 1) log (a + 1) The intuition behind this calculation is that most spam messages try to attract attention by referring to many of the currently hot topics in Twitter at once. Furthermore, many spam accounts are of short life time, as they get blocked as soon as they are identified. This straightforward spam S= classification can obviously benefit from tailor-made machine learning techniques. An assessment of appropriate methods is planned for future work. Greylist The greylist is used for common terms that may be relevant, i.e., that we do not want to block in general, but which also bear the potential to result in a drift towards unrelated topics. Consider the concert example from above. It happened that in such a case we automatically extended our filters to listen to the hashtag #top5bands. But, this is such a widely used hashtag on Twitter that the number of retrieved tweets, and consequently also the noise, bursted. To filter out such kind of unrelated tweets, we build a greylist containing hashtags which we call global hashtags: hashtags that bear the potential to create a drift from the “local” event to a “global” trend involving all or large parts of the overall Twittersphere. Any tweet containing an entry from the greylist and from the seed list (i.e., “#thinlizzy #top5bands #galway”) will enter the system – if nothing from the seed list is included, it will be discarded. Our current approach finds such global hashtags by manually checking the hashtags correlated to trending topics in Twitter and other Twitter trend websites.6 Further, the Twitter API can be used to query for hashtags representing currently trending topics. However, we refrained from adding these without any manual cross-check in order to avoid the before mentioned sparsity of input data. One obvious machine learning technique that will likely help to automate this process is stream-based burst detection. The intuition is that, due to their nature, following global hashtags will result in a burst of corresponding tweets. There are several applicable burst detection methods available, although we are not aware of any method particularly designed for Twitter. For future work, we plan to experiment with the method proposed in [Karnstedt et al., 2009], which is already available in the before mentioned stream system AnduIN [Klan et al., 2011]. However, for direct application, we will have to slightly modify the currently available implementation. For the time being we backed the greylist on a type of naive burst detection that makes use of the Twitter Search API. For each hashtag that we identify as potentially relevant, i.e., we decide to extend it for the next window, we query the API for all related tweets from the last 24 hours. A hashtag that returns more tweets than a pre-defined threshold is regarded as a global hashtag. This approach bears the obvious problem of having to set yet another threshold. At the moment, the threshold is chosen based on historical data and regular monitoring is required. Further, neither does Twitter guarantee that all corresponding tweets are returned by the API call, nor are all global hashtags always “bursty”. Another idea is to use statistical process control (SPC) methods for calculating the threshold that indicates outliers or special causes – which are then likely candidates for global hashtags. This would require a fairly stable process to begin with. This and other statistical methods for burst detection are currently on our agenda. 3 The Analytical Backbone The analytic goal of the application was to detect news and emerging topic trends by finding groups of actors and analysing the topics of these groups. Therefore, the analytic backbone comprises community identification, topical 6 http://trendsmap.com/, http://whatthetrend.com/ (a) Only retweet edges (b) All types of edges Figure 4: Communities with minimal size of 5 in different types of networks labelling of the communities and finally multiple ranking strategies. 3.1 Community Analysis In order to identify groups of actors, it is necessary to establish a notion of similarity or relatedness between the actors. An intuitive idea is to analyse the follow network of Twitter. But, we are more interested in the groups that are existing in-situ. Thus, we need a more dynamic notion of relatedness. For this, different modes of interaction amongst the Twitter users yield three dynamic social networks, which can be understood as three different modes of communication happening on Twitter [Glasgow et al., 2012]: • RETWEET: quoting other users • MENTION: talking about other users • REPLY: talking to other users In all the networks, the nodes are the users and the directed weighted edges represent one of the following types of interactions. First, a user i can be connected to j if i has retweeted j. Analogously, i can be linked to j when i has mentioned j in her tweets. Finally, the REPLY network consists of users linked by the who-replied-to-whom relation. The weights represent the number of the interactions we have observed, e.g., wij in the RETWEET network is the number of tweets of j retweeted by i. Apart from the three social networks, Twitter users can be also associated to hashtags they referred to in their tweets. This can be conveniently formalised as a bipartite network with n + h nodes, where n is the number of users, h the number of hashtags. This network contains m directed links, each connecting a user i to a hashtag j with weight wij if i has used the hashtag j wij times. Let us call such a bipartite network REFERS TO. We use the hashtags to determine topics of the tweets. In our use case all the networks were generally very noisy and sparse (n ≈ m), which hindered the community detection step since many of the nodes were isolates. Further, while many community detection methods have been developed in recent years [Fortunato, 2010], only few methods have been proposed to mine communities from multiplex (multi-relational) networks [Lin et al., 2009; Mucha et al., 2010]. These methods generally require a definition of relations between the individual networks (e.g., how the act of retweeting relates to mentioning, etc.), which is generally not obvious. This fact, along with the sparsity and noisiness of the individual social networks, led us to the decision to initially employ a simple approach: the three social networks were merged into one by adding up the weights of the links from the RETWEET, MENTION and REPLY networks. The resulting network represents a general relation INTERACTS WITH. We further assume that the INTERACTS WITH relation correlates with topical interests of the users. The merger, however, does not fully tackle the problem of noisiness, despite the fact that it amplifies the signal. Therefore, we employed a community detection method OSLOM [Lancichinetti et al., 2011], which identifies only statistically significant communities based on a randomised version of the graph. That way we effectively filtered the noisy relations and isolates. As a result, only ≈ 7% of nodes in INTER ACTS WITH were found to be members of some significant community. This number was generally much lower for communities detected in one of the three elementary social networks. We illustrate this in Figure 4. As an example for the three elementary networks, we show the RETWEET network from a certain point in time in Figure 4a. From the three discussed communication modes, the RETWEET relation is commonly the most frequently used one in Twitter [Kwak et al., 2010]. The figure clearly shows the many isolates and very small disconnected components at the periphery. The large hair ball on the right is a community around Olympics 2012 – which somehow managed to bypass our filters at that time. The second large component, which consists of a few smaller connected parts, is centred around the Volvo Ocean Race. In this network, OSLOM found only five communities. Figure 4b shows the merged network for the same time. It is about twice as large (4512 vs. 2860 nodes, 5802 vs. 2717 edges), but shows similar characteristics. The Olympics hair ball and the Volvo Ocean Race cluster are still obvious. The network now reveals connections between both (which might indicate how the Olympics topic ended in our data) and a third larger component’s appearance. Interestingly, the one on the bottom left is centred around Galway users tweeting about the race, while the new one above it is centred around the accounts of participating race teams – and both are connected by a small group around the official Twitter account of the race. In the merged network OSLOM identified 25 communities. One might think that this is actually a small fraction given the clustered appearance of the network. However, often many users refer to a few very central ones (like the official Olympics account), but rather sparsely among each other. We believe that this is not an indication of a “news spreading community” that we are actually looking for. In fact, we found that most of the 25 communities are as expected and aimed at, such as one around DERI and NUI Galway spreading the news about the Volvo Ocean Race app, communities around the teams giving race updates, etc. However, we consider applying different community detection algorithms to assess the impact of finding more (and thus, noisier) communities. Another feature of OSLOM is that it yields overlapping community structure, i.e., a node can be a member of multiple communities. This is an important factor, because the communities are assumed to be centred around shared topics and indeed a user may be interested in multiple distinct topics, which results in her membership in multiple communities. But, to our surprise, this case occurred very seldom. Twitter communities seem to be rather disjoint, although often well connected. In order to reveal changes in the community structure, we segmented the streamed INTERACTS WITH network by a sliding time window 36 hours wide and overlapping by 35 hours, i.e., the window was moved each time by one hour. We observed that any narrower window gives too little signal. Further, as the majority of relevant Twitter users were likely to be geographically collocated near the location of the event, there was a regular gap in their activity during the night hours. Therefore, the 36 long window effectively represents approximately 24 hours of user activity. An independent set of communities Ct = {C1 , . . . , Ck } was then obtained by running OSLOM on each network time slice t. In order to maintain an intuitive user experience in the client application, it was necessary to track the same communities from one time slice to the next. We adopted the approach proposed by [Greene et al., 2010]. This formulates the process of tracking as a problem of maximal bipartite matching between the set of clusters from time slice ti and ti+1 . All communities from both slices are represented as nodes in the bipartite graph, whose weighted links represent similarities between each possible pair of clusters from the two subsequent time slices. We used the Jaccard coefficient to compute this similarity sC : |C1 ∩ C2 | sC = , C1 ∈ Cti , C2 ∈ Cti+1 |C1 ∪ C2 | Finally, we did not match clusters whose similarity was below a threshold θc = 0.2. 3.2 Topic Labelling After the community structure was obtained, it was necessary to label each community by the topic it was centred around. Recall that a weight wij in the REFERS TO network represents the number of times a user i has used a hashtag j. The joint frequency fjC of a hashtag j in a community C = {u1 , . . . ,P ul } with l users can therefore be easily obtained as fjC = i∈C wij . Note that even though all the tags of the community members are associated with the community, we expect the modes of the distribution of the community’s hashtags to be located at the hashtags representing similar and coherent topics. In order to select only the relevant themes for the community and filter out noise, we used a frequency threshold θf and labelled a community C by only topics with fjC ≥ θf . We observed that choosing θf = 2, i.e., filtering out just the hashtags that were used only once by a member of a community, already significantly improved the quality of the labels. This also confirmed our assumption about topical coherence of the mined communities. While hashtag usage is a natural and intuitive approach for labelling of communities, many users do not use hashtags in their tweets. Hence we believe the labelling process can be improved by more advanced text processing techniques like keyword/key-phrase extraction, named entity recognition, etc. We further discuss this in Section 4. 3.3 Ranking is Vital As mentioned in Section 2.2, there is an unavoidable tradeoff between having too much noise in the system vs. having not enough interesting and relevant information. As such, we cannot avoid to always find communities and topics that are not perfectly (or better, obviously) related to the event. The only way to overcome this and to still provide a satisfying user experience is an appropriate ranking. With a good ranking, the crucial and relevant information will always be the first the user sees – and rather unrelated information is shown at the bottom of the groups list. Thus, after the communities were identified and labelled, multiple ranking strategies were used. First, as we have already described, an importance of a topic (hashtag) j within a community C is measured by its joint frequency fjC . Second, the communities were ranked according to their correspondence with the global focus of the application. Third, the members of each community were ranked in order to present the most relevant tweets in each community. The global focus of the application is determined by the set of p seed hashtags S = {h1 , . . . , hp }, which are used to obtain the tweets relevant to the event (see Section 2.2). Our first approach was to measure that similarity by Jaccard coefficient between the seed set and the set of community hashtags. This works well, but can have a drawback in certain situations. If a community has many hashtags assigned, which cover a significant fraction of the seed hashtags, it can still be ranked lower than a community with only a few hashtags that cover only a small part of the seed hashtags. Thus, in our current version, we remove that dependence on the number of hashtags in a community and use only the number of overlapping hashtags: rC = |S ∩ H|, H = {j : fjc ≥ θf } We are still analysing which of the two rankings is better suited in which situations. As a matter of fact, both provide a significant improvement of the results forwarded to the mobile client. If two communities have the same rank rC1 = rC2 , then the one with more hashtags is ranked higher. The intuition behind this is that a community with more hashtags is more specific and should thus be listed first. If also the number of hashtags is equal, then they are finally ranked by their size, i.e., the bigger communities are given priority. In order to rank the users within a community, we first induced a subgraph of the INTERACTS WITH network by 2000 Added seed terms 1800 Removed seed terms 1600 Topic left the app Topic appeared in the app 1400 Tweet Count including only the members of the community and links between them. Then the nodes were ranked by PageRank. We used particularly PageRank because it has been frequently used as a heuristic for capturing influence and authority of actors in social networks [Sun and Tang, 2011]. Finally, tweets in the mobile application are ordered by decreasing PageRank value of their authors. For further versions we plan to provide the option to the user to switch to a purely time-based ordering, as it is provided by standard Twitter applications. We further consider experimenting with other social ranking mechanisms, such as taking the actual user, his preferences and personal networks into account [Gou et al., 2010; Haynes and Perisic, 2009]. One promising approach is to apply a tailored notion of PageRank like the FolkRank proposed in [Hotho et al., 2006]. Our hypothesis is that the most central users in the observed networks are the most important ones – and the ones that most frequently initiate the most important news. As such, we also decided to follow these users until the next community analysis is performed (currently in 1 hour intervals). The 33% or top-3 users (whichever is lower) are therefor signalled to the filtering component, which cares for an appropriate extension (see Section 2.2). Further, the mobile application shows the last received tweet from the two highest ranked users of each community on its main screen (see Figure 1). It is thus of importance that these tweets are indeed the most recent ones from these users. 1200 1000 800 600 400 200 0 12 14 16 18 20 22 0 23 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 24 8 10 12 14 16 18 20 22 25 0 2 4 6 8 10 26 Days/Hours of day Figure 5: “Topic latency”: timely detection, late removal proach to a group-centred one works out. With first tests based on pure topic clustering in the analysed time slices, we found that, as expected, large topics are “swallowing” smaller ones – in our approach the listed groups and topics range from small to large. Nevertheless, we experienced a sort of “topic latency” problem, as shown in Figures 5 and 6 for two events during the race. Figure 5 shows the number of tweets over time for a very popular band and indicates the time span that we observed a corresponding topic in our app. As one can see, a large number of tweets results in an accurate early detection – but also in a belated removal of the topic. In contrast, an event with a rather low number of tweets (Figure 6) can result in a belated topic display but a timely removal. We plan to tackle this issue as follows: 10 Lessons Learned and Future Work Our general conclusion is positive, as we can say “It works!”. We continuously monitored the application and the determined groups and topics before and during the Volvo Ocean Race in Galway. Indeed, the application displayed mostly relevant groups with interesting and relevant topics. This ranged from small to large topics and clearly separated certain groups of interest, such as around the race teams or certain events – which, as a byproduct, also separated groups of particular languages. Still, as we discussed before, it is unavoidable to get certain topics into the application that are not obviously related to the event. To our relief, the applied ranking method always positioned them in the lower end of the groups list. Still, we plan to run experiments with different, less restrictive community detection methods to assess the quality of the currently identified groups – which are often rather small. There are many important lessons we have learned and a wide range of identified challenges and research topics for future work. They arose from the mere focus on developing a real-time analytics application with a user-friendly interface, and also from issues we observed regarding the applied concepts and general design choices. One of the most crucial challenges lies in the actual analytics. Our batchbased approach can only work because we handle a rather limited part of the whole Twitter stream (due to the filtering). This might not always be the case and appropriate. Moreover, any method based on a priori defined fixed windows is usually prone to provide sub-optimal results. As such, we plan to intensely investigate methods for real-time community detection and analysis. Our main focus will be on how to achieve a community monitoring that first supports the identification of change points, which signal the need for re-running the network clustering. Later, we will assess opportunities to do the whole community detection itself in a streaming fashion. The general idea of moving from a topic-centred ap- 8 Removed seed terms Topic pp appeared in the app Added seed terms 9 Topic left left the app 7 Tweet Count 4 6 5 4 3 2 1 0 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 21 22 23 24 25 Days/Hours of day Figure 6: “Topic latency”: late detection, timely removal 1. Decouple the topic detection from the actual group detection. While we have to use a larger window for finding a community signal in the networks, this is not necessary for the topic labelling of a community. We plan to mine these topic labels in shorter intervals for the found communities, which will also reflect topic changes in groups more accurately. 2. We identified the need for integrating a more sophisticated method for trend and event detection. This should be combined with a method for detecting bursts, which we already plan to incorporate into the filtering component (see Section 2.2). In order to be able to do this in (near) real time, we consider to adopt more stream-mining approaches and will investigate cloud-based computing frameworks [Vakali et al., 2012]. 3. Include NLP techniques to extend the topic labelling from considering only hashtags to general keywords. First tests in this direction were promising. This will, however, require a higher weighting of hashtags, as they are generally understood as topic labels by the Twitter community. During the development of the app, we were surprised to see how crucial the quality of the filtering component actually is. Similar to the above, we identified the strong need for advanced burst and trend detection methods, as well as other suited knowledge discovery and machine learning techniques. In order to make the application really work smoothly, we have to achieve an almost 100% automated maintenance of the filters. In fact, feeding the filters is currently the one component that requires the most manual input and benefits the least from automated analyses. Nevertheless, this manual and laborious process helped us to clearly identify the main challenges and problems we will have to face when aiming for an automated approach: • Our results are very reliant on Twitter users behaviour, e.g., retweeting, usage of Twitter lists, accurate user descriptions and the usage/non-usage of hashtags. This makes it very challenging to remove the manual element to finding suitable users and terms. Which data mining techniques are suited to help here? Can they be applied out of the box or do they have to be adopted to the particular use case? • The decision-making for including/excluding users and terms is very subjective. For instance, should we include all users on a relevant Twitter list or should we filter them manually? Should we exclude users who are mostly advertising products? Maybe some of these products are interesting to the users of the app. Should we put all global hashtags on the greylist, all topics trending on Twitter? This might be problematic when considering larger events, such as the Olympics, which bear a global character in themselves. Similar holds for the importance of post-processing, i.e., what and how to show in the mobile client. Should we display retweeted messages just because they are most recent, or rather only the original tweets? What minimal group size actually makes sense? Should tweets appear in multiple groups if the author belongs to more than one? These things are also related to the crucial aspect of appropriate ranking techniques. This is in the first line concerning the ranking of groups and thus topics. However, it also refers to the ranking of users, tweets and keywords to label topics. In the same line, we believe that the application provides perfect ground for research on personalisation. We envision to include personal preferences, personal social networks, own tweets and topics, etc. This will strongly benefit from the aforementioned social ranking techniques, but also from research in the area of user interfaces and HCI. Last but not least, the probably expected statement about the many unexpected issues: Never underestimate Murphy’s law! Although the app was running smoothly for days before the actual event, at the time of most importance we ran into issues with temporary outages of our Twitter connection, hardware failures, etc. This can in parts be accommodated by unit tests, much more intense than the norm for any research prototype. Similar, one tends to underestimate the efforts required to build such a complete architecture and the wide range of issues, from tiny to crucial, one has to face during the design and development. As researchers we are not primarily used to creating working applications in a given time – for us this was a very good lesson about all the glitches and itches around that. Acknowledgements This work was jointly supported by the European Union (EU) under grant no. 257859 (ROBUST integrating project) and Science Foundation Ireland (SFI) under grant no. SFI/08/CE/I1380 (LION-2). References [Cha et al., 2010] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. Measuring User Influence in Twitter: The Million Follower Fallacy. In ICWSM, 2010. [Fortunato, 2010] Santo Fortunato. Community detection in graphs. Physics Reports, 486:75–174, 2010. [Glasgow et al., 2012] Kimberly Glasgow, Alison Ebaugh, and Clayton Fink. #londonsburning: Integrating geographic, topical and social information during crisis. In ICWSM, 2012. [Gou et al., 2010] Liang Gou, Xiaolong (Luke) Zhang, Hung-Hsuan Chen, Jung-Hyun Kim, and C. Lee Giles. Social network document ranking. In JCDL, pages 313– 322, 2010. [Greene et al., 2010] D. Greene, D. Doyle, and P. Cunningham. Tracking the evolution of communities in dynamic social networks. In ASONAM, pages 176–183, 2010. [Haynes and Perisic, 2009] Jonathan Haynes and Igor Perisic. Mapping search relevance to social networks. In Workshop on Social Network Mining and Analysis (SNA-KDD), pages 2:1–2:7, 2009. [Hotho et al., 2006] Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. Information retrieval in folksonomies: search and ranking. In ESWC, pages 411–426, 2006. [Karnstedt et al., 2009] M. Karnstedt, D. Klan, Chr. Politz, K.-U. Sattler, and C. Franke. Adaptive burst detection in a stream engine. In SAC, pages 1511–1515, 2009. [Klan et al., 2011] D. Klan, M. Karnstedt, K. Hose, L. Ribe-Baumann, and K. Sattler. Stream Engines Meet Wireless Sensor Networks: Cost-Based Planning and Processing of Complex Queries in AnduIN. Distributed and Parallel Databases, 29(1):151–183, January 2011. [Kwak et al., 2010] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010. [Lancichinetti et al., 2011] A. Lancichinetti, F. Radicchi, J.J. Ramasco, and S. Fortunato. Finding statistically significant communities in networks. PloS one, 6(4):e18961, 2011. [Lin et al., 2009] Y.R. Lin, J. Sun, P. Castro, R. Konuru, H. Sundaram, and A. Kelliher. Metafac: community discovery via relational hypergraph factorization. In SIGKDD, pages 527–536, 2009. [Mucha et al., 2010] P.J. Mucha, T. Richardson, K. Macon, M.A. Porter, and J.P. Onnela. Community structure in time-dependent, multiscale, and multiplex networks. Science, 328(5980):876, 2010. [Sun and Tang, 2011] J. Sun and J. Tang. Social Network Data Analytics, chapter A survey of models and algorithms for social influence analysis, pages 177–214. Springer, 2011. [Vakali et al., 2012] Athena Vakali, Maria Giatsoglou, and Stefanos Antaris. Social networking trends and dynamics detection via a cloud-based framework design. In Workshop on Mining Social Network Dynamics (MSND –WWW Companion), pages 1213–1220, 2012.