Event Panning in a Stream of Big Data

advertisement
Event Panning in a Stream of Big Data
Hugo Hromic, Marcel Karnstedt, Mengjiao Wang, Alice Hogan, Václav Belák, Conor Hayes
DERI, National University of Ireland Galway
{first.last}@deri.org
Abstract
In this paper, we present a hands-on experience report from designing and building an architecture for preprocessing & delivering real-time
social-media messages in the context of a large
international sporting event. In contrast to the
standard topic-centred approach, we apply social
community analytics to filter, segregate and rank
an incoming stream of Twitter messages for display on a mobile device. The objective is to provide the user with a “breaking news” summary
of the main sources, events and messages discovered. The architecture can be generally deployed
in any context where (mobile) information consumers need to keep track of the latest news &
trends and the corresponding sources in a stream
of Big Data. We describe the complete infrastructure and the fresh stance we took for the analytics, the lessons we learned while developing
it, as well as the main challenges and open issues
we identified.
1
Introduction
Although large-scale data analytics as well as stream processing and mining are well established research domains,
they recently experience an increased attention in the advent of the Big Data movement. While some of this attention might be just due to hype, the characteristics of what
is commonly understood as Big Data analytics pose a wide
range of interesting and in parts novel challenges. People
often focus on issues of data management and processing
when considering Big Data. However, there are as many
untackled challenges with respect to knowledge discovery
and data mining at the heart of the actual analytics. This
is particularly true in the context of Big Data streams, for
which many of the usually considered approaches, such as
the MapReduce paradigm, are not directly adoptable.
In this paper, we report on our own experience from a
“hands-on” exercise on the quest of discovering knowledge
from one of the largest Big Data streams publicly available these days, Twitter. We collected these experiences
in the context of creating a mobile application for the final stop of the Volvo Ocean Race 2012 in Galway, Ireland.
For Galway, a city of ca. 75,000 citizens, this was a major event with more than 900,000 visitors during one week.
We asked ourselves the question: How can we leverage the
experience of the visitors so that they get the most from the
event and their time in Galway? Twitter was quickly identified as probably the best, fastest and most multifaceted
source of information for this purpose. But what is needed
to extract, analyse, prepare and present the most relevant
information from this huge social-media stream for such a
particular and – compared to the overall size of the Twittersphere – rather small and local event? This work summarises our architecture spanning from a raw Big Data
stream over large-scale and streaming analytics to the enduser mobile application. While we focus on the core analytics and the therein novel approach we took, we briefly
cover all main aspects of the whole framework. As such,
we also touch areas of information retrieval, Web search,
personalisation and ubiquitous access on mobile devices.
The widespread use of social media on internet-enabled
mobile devices has created new challenges for research. As
users report their experiences from the world, information
consumers are faced with an increasing number of sources
to read in real time. Furthermore, the small screen size of
mobile devices limits the amount of information that can
be shown and interacted with. In short, social media users
on mobile devices are faced with increasing amounts of
information and less space to consume it. Thus, finding,
analysing and presenting information related to one particular event of, although high, but still mostly local relevance,
is comparable to the search of the needle in the haystack.
The intuitive approach that comes into mind is to mine the
stream for particular topics of interest, and try to pick those
that are of high relevance. However, this is prone to miss
particular information, as large dominating topics will often overshadow the rather small ones. Moreover, it creates
a mix of content that comes from many different and often
unrelated groups of users, which might obscure particular
facets of the groups’ messages.
For these reasons, we decided to take a fresh stance on
this problem. Our approach is based on several observations in literature [Kwak et al., 2010; Glasgow et al., 2012;
Cha et al., 2010] that characterise Twitter as a novel mix
between news media and social network. As such, we hypothesise that news and topics of particular interest will be
amplified by groups of users that reveal close and frequent
communication relations on Twitter. Thus, we argue that
by mining for closely connected groups and analysing their
topics, we actually search for the easier (but still challenging) to spot pin cushion rather than the single needle. To
use another metaphor, we compare this task to the panning
for gold, and propose to:
(i) pan only in the best part of the Big Data stream by applying a set of complex filters that benefit from knowledge discovery and machine learning,
(ii) use network analysis to achieve successful screening
to identify the truly valuable nuggets of information,
(a) Main Screen
(b) Tweets Screen
(c) Group Screen
Figure 1: Screens of the mobile app
(iii) apply social ranking to bring the most valuable information to the highest attention.
The contribution of the work in hand is a summary report from this hands-on experience. We discuss the major
issues we encountered, the expected and unexpected challenges we had to face and the data mining approaches we
integrated as well as those that we identified as potentially
beneficial for further improvement. We expect that this experience report can help others in designing & developing
applications based on advanced Big Data stream analytics.
2
Application Overview
We developed our application as part of the official mobile application for the Volvo Ocean Race 2012 in Galway,
Ireland, which was completely created by DERI, NUI Galway.1 After more than three months of hard work, several
major and minor changes to our approach and some sleepless nights, we were happy to finish Tweet Cliques2 , a small
but smart app that shows what’s up and hot around the event
on Twitter. A recent blogpost3 nicely summarises a main
issue of Big Data, whose truth we could identify nearly every day: it’s not just the analytics. All surrounding tasks,
such as filtering, pre- and post-processing, are of equal importance. In this section, we give a complete overview
of all the resulting crucial ingredients and the lessons we
learned, before we focus on the actual analytical backbone
in Section 3.
Figure 1 shows our final product, the mobile application.
On the main screen, we show the groups we found and the
most recently observed tweets from the most central group
members. On the second screen, we show the most recently
observed tweet from each group member, ranked by the importance of the users. Finally, the third screen shows the
users and the computed “Clique Score” (used for ranking,
see Section 3.3) as well as the number of outgoing and incoming links of the users in the group.
2.1
Architecture
The main architecture is shown in Figure 2. It comprises
several interconnected components, from the live Twitter
1
http://www.deri.ie/about/press/coverage/details/
?uid=265&ref=214
2
3
Information and download: http://uimr.deri.ie/tweet-cliques
http://wp.sigmod.org/?p=430
stream up to the final end-user mobile application. The core
idea is to process tweets that pass the filter continuously
and perform analytics to extract community structures and
topics in a batch mode. However, incoming tweets are always classified to the according groups and delivered in
real time. The architecture is fully modular with a clear
separation of the different processing logics involved.
The system considers three kind of data flows for processing:
• Streaming flow: data is processed continuously and
endlessly, using a publisher/subscriber distributed
model. The communication channel is always active
and data processing is totally event-driven. This flow
is able to deliver events to multiple components at the
same time.
• System-triggered flow: data is processed at regular intervals, controlled by known system timers (i.e., window slides). Data transport is only active when requested and fixed amount of data is passed between
components in a point-to-point strategy.
• On-demand flow: very similar to the above flow, with
the key difference that this is triggered by external
events, such as users starting new sessions. This flow
is also point-to-point, active upon requests and data
flows in fixed amounts.
Two different databases provide the storage backbone for
the application:
• Graph database: This store is used for real-time creation of relationship networks (mentions, retweets,
replies and hashtag referrals). Time-stamped nodes
and edges are inserted into the database on the fly after carrying out an according entities extraction.
• Relational database: this storage keeps the relevant
tweets’ content and the ranked community structures
(including users and labels). Content is inserted as
soon as tweets arrive, whereas community data is inserted at regular intervals (system triggered) from the
analytics component.
We describe the two main components, the Stream Filter and the Analytics, in the following sections. The Tweets
Classifier works on the data provided by the Analytics component and assigns tweets to groups. The communication
between the Mobile App and the backend is based on a
Twitter
Raw Tweets
Stream
Filter
Relevant Tweets
Tweets
Classifier
Entities Extraction
Feedback:
Users to Follow
Initial Tweets & Communities Data
Tweets, Topics &
Communities
Mobile App
Signal:
Updated Communities Data
Relationships
GraphDB
RelationalDB
Ranked Communities, Users and Labels
Networks
Slices
Data Flow Types
Streaming Flow
Analytics
System Triggered Flow
On-­‐demand Flow
Figure 2: Architecture overview
Web service, which allows to run the application on a wide
range of different mobile platforms.
2.2
Stream Filtering
The Twitter public stream APIs4 provide real-time access
to Twitter data by following specific users and topics. Due
to these mere technical reasons, and concerns about scalability, we knew from the beginning that we had to filter out
a particular part from the global Twitter stream. However,
we quickly learned that filtering is not only crucial because
of these reasons. Rather, the trade-off between too much
noise in the data (i.e., too relaxed filters) and sparsity of input (i.e., too strict filters) has a crucial impact on the results
gained from the actual analytics. Consequently, the filtering component grew with every day into a complex system
of different techniques (see Figure 3), which still bears a
lot of potential for improvement.
Analytics
Twitter
Users to Follow
Raw Tweets
Streaming API
Search API
Extended Hashtags
Tweets for Hashtags
Filter Input
Seed Users &
Keywords
Burst Checker
Extended Users
Extender
Trending Topics
Past 4 hours
Words Blacklist
Hashtags Greylist
Co-­‐occurrence Tables
User/Hashtag &
Hashtag/Hashtag
Filter Core
Co-­‐occurrence Computation
Raw Past Tweets
Extension Logic
Relevant Tweets
Output Stream: Relevant Tweets
Figure 3: Overview of the filtering component
We make use of two parameters that the stream APIs
support to retrieve tweets from the public stream:
Seeds As a starting point, we manually chose a number
of seed users and keywords that we identified as highly
relevant to the Volvo Ocean Race and which we followed
permanently. If we want to filter particularly for hashtags
(commonly understood as topic labels on Twitter), we precede a keyword by the ‘#’ sign. This is important for some
hashtags like ‘#vor’, which would return a large amount of
unrelated tweets if used directly as a keyword. Further, one
filter keyword can actually contain several words, which returns all tweets containing each of the words at any position
in its text.
In our current implementation, the addition of keywords,
hashtags and users to the seed lists is mainly a manual
and subjective process. Seed users were initially chosen
by analysing Twitter lists relevant to the event. We picked
users from these lists and their followers, i.e., a 2-hop user
collection. Then we cleaned the resulting list manually.
However, soon after having the first set of filters in place,
we recognised that the overlap between the set of users actually using the seed hashtags and the set of users from the
Twitter list approach is rather small, which might be due to
a rather unpopular and ad-hoc usage of Twitter lists. Thus,
we decided to manually check the descriptions and latest
tweets of all seen users in order to determine if they are relevant or if they are mostly doing advertisement. Additional
seed users are also found by manually checking the users
that another seed user is following. This method was, for
instance, used to find the list of Galway bars, as there was
no according Twitter list in place. Other social media, e.g.,
boards.ie5 , can also be used to find good seed users.
The seed terms are found by looking at historical hashtags for the type of event, searching Twitter for any currently used hashtags and also pre-emptively guessing seed
keywords and seed hashtags that may be used in the future.
Before a term is added to the seed list as keyword or hashtag only, its search results are manually checked in Twitter
to determine the type of results that it is likely to generate.
This helps to estimate the level of noise that the addition of
the term will generate.
1. Follow: a list of user IDs
2. Track: a list of keywords
In the following, we describe the two main building
blocks of the filtering component: a set of filter inputs that
define what we actually fetch from the stream (and filter
out again, e.g., spam) and an extension logic aiming at a
dynamic maintenance of these filter inputs.
4
https://dev.twitter.com/docs/api/1/post/statuses/
filter
Extension Logic Hashtags in Twitter are very heterogeneous, as they may be freely chosen by users and can refer
to particular topics in many different ways. As such, it is
impossible to cover all related users and keywords in the
a priori created lists. Therefore, we try to identify other
relevant keywords and users by analysing the tweets we retrieve based on the currently active filters. Then, we extend
5
http://www.boards.ie/vbulletin/showthread.php?t=
2056640325
the seed lists, but only for a certain amount of time. After
that time we re-evaluate the extension. Like this, extended
keywords and users are included into the filters only for the
time that we consider them relevant based on our extension logic. We extend periodically based on the following
approaches:
1. Hashtag co-occurrence: We find hashtags that cooccurred with seed hashtags and extend by those that
co-occur more frequently than a predefined threshold
(in fact, a mix of an absolute and a relative threshold). We restrict this co-occurrence to the seed hashtags only in order to avoid a drift to unrelated topics.
2. Hashtag-user co-occurrence: We extend the user list
by users who frequently use seed hashtags.
3. We follow users that are identified as central in their
community by the analytics component (see Section 3.3). We do this in order to always retrieve the
most recent tweets from these important users.
For example, tweets about a concert of the band “Thin
Lizzy” may contain both, #volvooceanrace and #thinlizzy,
but also #tlizzy, #lizzy, etc. – impossible to cover all possibilities. Moreover, these hashtags are relevant for our
app only in a limited time window. This temporary importance is covered by the extension based on hashtag cooccurrence.
This type of co-occurrence checking is related to detecting frequent itemsets in the stream of incoming tweets. We
experimented with an according implementation provided
by the stream data management system AnduIN [Klan et
al., 2011]. However, we found that hashtags are actually
occurring more frequently alone or in pairs than in larger
sets. We are nevertheless planning to integrate this streammining method closer into our extension logic.
Trending Topics The trending topic facility on Twitter
caters for those interested in the most frequently occurring
hashtags and keywords overall in Twitter. Users can query
individual hashtags and get a listing of all quoting posts.
However, this does not solve the problem of filtering and
ranking and selecting these posts for presentation on a mobile device. Nevertheless, we can benefit from the list of
up-to-date trending topics for maintaining the seed lists, but
also to feed the other filter inputs, as we describe next.
Blacklist It is unavoidable to retrieve a lot of unrelated
and in fact unwanted tweets from the public stream, such
as porn-related tweets. In order to filter these out again, we
built a blacklist containing words and phrases that we can
safely correlate with such noise. Any tweet containing an
entry from the blacklist is discarded. Further, we calculate
a spam score S for each retrieved tweet, inspired by [Cha
et al., 2010]. We consider the number of hashtags h and
the number of trending topics t appearing in a tweet and
also the account age a of the user (number of days since
the account created).
(h + 1) · (t + 1)
log (a + 1)
The intuition behind this calculation is that most spam
messages try to attract attention by referring to many of the
currently hot topics in Twitter at once. Furthermore, many
spam accounts are of short life time, as they get blocked
as soon as they are identified. This straightforward spam
S=
classification can obviously benefit from tailor-made machine learning techniques. An assessment of appropriate
methods is planned for future work.
Greylist The greylist is used for common terms that may
be relevant, i.e., that we do not want to block in general, but
which also bear the potential to result in a drift towards unrelated topics. Consider the concert example from above. It
happened that in such a case we automatically extended our
filters to listen to the hashtag #top5bands. But, this is such a
widely used hashtag on Twitter that the number of retrieved
tweets, and consequently also the noise, bursted. To filter
out such kind of unrelated tweets, we build a greylist containing hashtags which we call global hashtags: hashtags
that bear the potential to create a drift from the “local” event
to a “global” trend involving all or large parts of the overall Twittersphere. Any tweet containing an entry from the
greylist and from the seed list (i.e., “#thinlizzy #top5bands
#galway”) will enter the system – if nothing from the seed
list is included, it will be discarded.
Our current approach finds such global hashtags by manually checking the hashtags correlated to trending topics
in Twitter and other Twitter trend websites.6 Further, the
Twitter API can be used to query for hashtags representing currently trending topics. However, we refrained from
adding these without any manual cross-check in order to
avoid the before mentioned sparsity of input data.
One obvious machine learning technique that will likely
help to automate this process is stream-based burst detection. The intuition is that, due to their nature, following global hashtags will result in a burst of corresponding
tweets. There are several applicable burst detection methods available, although we are not aware of any method
particularly designed for Twitter. For future work, we plan
to experiment with the method proposed in [Karnstedt et
al., 2009], which is already available in the before mentioned stream system AnduIN [Klan et al., 2011]. However, for direct application, we will have to slightly modify
the currently available implementation. For the time being
we backed the greylist on a type of naive burst detection
that makes use of the Twitter Search API. For each hashtag that we identify as potentially relevant, i.e., we decide
to extend it for the next window, we query the API for all
related tweets from the last 24 hours. A hashtag that returns more tweets than a pre-defined threshold is regarded
as a global hashtag. This approach bears the obvious problem of having to set yet another threshold. At the moment,
the threshold is chosen based on historical data and regular
monitoring is required. Further, neither does Twitter guarantee that all corresponding tweets are returned by the API
call, nor are all global hashtags always “bursty”.
Another idea is to use statistical process control (SPC)
methods for calculating the threshold that indicates outliers or special causes – which are then likely candidates
for global hashtags. This would require a fairly stable process to begin with. This and other statistical methods for
burst detection are currently on our agenda.
3
The Analytical Backbone
The analytic goal of the application was to detect news
and emerging topic trends by finding groups of actors and
analysing the topics of these groups. Therefore, the analytic backbone comprises community identification, topical
6
http://trendsmap.com/, http://whatthetrend.com/
(a) Only retweet edges
(b) All types of edges
Figure 4: Communities with minimal size of 5 in different types of networks
labelling of the communities and finally multiple ranking
strategies.
3.1
Community Analysis
In order to identify groups of actors, it is necessary to establish a notion of similarity or relatedness between the
actors. An intuitive idea is to analyse the follow network
of Twitter. But, we are more interested in the groups that
are existing in-situ. Thus, we need a more dynamic notion of relatedness. For this, different modes of interaction
amongst the Twitter users yield three dynamic social networks, which can be understood as three different modes
of communication happening on Twitter [Glasgow et al.,
2012]:
• RETWEET: quoting other users
• MENTION: talking about other users
• REPLY: talking to other users
In all the networks, the nodes are the users and the directed
weighted edges represent one of the following types of interactions. First, a user i can be connected to j if i has
retweeted j. Analogously, i can be linked to j when i has
mentioned j in her tweets. Finally, the REPLY network consists of users linked by the who-replied-to-whom relation.
The weights represent the number of the interactions we
have observed, e.g., wij in the RETWEET network is the
number of tweets of j retweeted by i.
Apart from the three social networks, Twitter users can
be also associated to hashtags they referred to in their
tweets. This can be conveniently formalised as a bipartite network with n + h nodes, where n is the number of
users, h the number of hashtags. This network contains m
directed links, each connecting a user i to a hashtag j with
weight wij if i has used the hashtag j wij times. Let us call
such a bipartite network REFERS TO. We use the hashtags
to determine topics of the tweets.
In our use case all the networks were generally very
noisy and sparse (n ≈ m), which hindered the community detection step since many of the nodes were isolates.
Further, while many community detection methods have
been developed in recent years [Fortunato, 2010], only few
methods have been proposed to mine communities from
multiplex (multi-relational) networks [Lin et al., 2009;
Mucha et al., 2010]. These methods generally require
a definition of relations between the individual networks
(e.g., how the act of retweeting relates to mentioning, etc.),
which is generally not obvious. This fact, along with the
sparsity and noisiness of the individual social networks, led
us to the decision to initially employ a simple approach: the
three social networks were merged into one by adding up
the weights of the links from the RETWEET, MENTION and
REPLY networks. The resulting network represents a general relation INTERACTS WITH.
We further assume that the INTERACTS WITH relation
correlates with topical interests of the users. The merger,
however, does not fully tackle the problem of noisiness,
despite the fact that it amplifies the signal. Therefore, we
employed a community detection method OSLOM [Lancichinetti et al., 2011], which identifies only statistically significant communities based on a randomised version of the
graph. That way we effectively filtered the noisy relations
and isolates. As a result, only ≈ 7% of nodes in INTER ACTS WITH were found to be members of some significant
community. This number was generally much lower for
communities detected in one of the three elementary social
networks.
We illustrate this in Figure 4. As an example for the
three elementary networks, we show the RETWEET network from a certain point in time in Figure 4a. From the
three discussed communication modes, the RETWEET relation is commonly the most frequently used one in Twitter [Kwak et al., 2010]. The figure clearly shows the many
isolates and very small disconnected components at the periphery. The large hair ball on the right is a community
around Olympics 2012 – which somehow managed to bypass our filters at that time. The second large component,
which consists of a few smaller connected parts, is centred
around the Volvo Ocean Race. In this network, OSLOM
found only five communities. Figure 4b shows the merged
network for the same time. It is about twice as large (4512
vs. 2860 nodes, 5802 vs. 2717 edges), but shows similar characteristics. The Olympics hair ball and the Volvo
Ocean Race cluster are still obvious. The network now reveals connections between both (which might indicate how
the Olympics topic ended in our data) and a third larger
component’s appearance. Interestingly, the one on the bottom left is centred around Galway users tweeting about the
race, while the new one above it is centred around the accounts of participating race teams – and both are connected
by a small group around the official Twitter account of the
race.
In the merged network OSLOM identified 25 communities. One might think that this is actually a small fraction
given the clustered appearance of the network. However,
often many users refer to a few very central ones (like the
official Olympics account), but rather sparsely among each
other. We believe that this is not an indication of a “news
spreading community” that we are actually looking for. In
fact, we found that most of the 25 communities are as expected and aimed at, such as one around DERI and NUI
Galway spreading the news about the Volvo Ocean Race
app, communities around the teams giving race updates,
etc. However, we consider applying different community
detection algorithms to assess the impact of finding more
(and thus, noisier) communities.
Another feature of OSLOM is that it yields overlapping
community structure, i.e., a node can be a member of multiple communities. This is an important factor, because the
communities are assumed to be centred around shared topics and indeed a user may be interested in multiple distinct topics, which results in her membership in multiple
communities. But, to our surprise, this case occurred very
seldom. Twitter communities seem to be rather disjoint,
although often well connected.
In order to reveal changes in the community structure,
we segmented the streamed INTERACTS WITH network by
a sliding time window 36 hours wide and overlapping by 35
hours, i.e., the window was moved each time by one hour.
We observed that any narrower window gives too little signal. Further, as the majority of relevant Twitter users were
likely to be geographically collocated near the location of
the event, there was a regular gap in their activity during
the night hours. Therefore, the 36 long window effectively
represents approximately 24 hours of user activity.
An independent set of communities Ct = {C1 , . . . , Ck }
was then obtained by running OSLOM on each network
time slice t. In order to maintain an intuitive user experience in the client application, it was necessary to track
the same communities from one time slice to the next. We
adopted the approach proposed by [Greene et al., 2010].
This formulates the process of tracking as a problem of
maximal bipartite matching between the set of clusters
from time slice ti and ti+1 . All communities from both
slices are represented as nodes in the bipartite graph, whose
weighted links represent similarities between each possible
pair of clusters from the two subsequent time slices. We
used the Jaccard coefficient to compute this similarity sC :
|C1 ∩ C2 |
sC =
, C1 ∈ Cti , C2 ∈ Cti+1
|C1 ∪ C2 |
Finally, we did not match clusters whose similarity was below a threshold θc = 0.2.
3.2
Topic Labelling
After the community structure was obtained, it was necessary to label each community by the topic it was centred
around. Recall that a weight wij in the REFERS TO network represents the number of times a user i has used a
hashtag j. The joint frequency fjC of a hashtag j in a community C = {u1 , . . . ,P
ul } with l users can therefore be easily obtained as fjC = i∈C wij . Note that even though all
the tags of the community members are associated with the
community, we expect the modes of the distribution of the
community’s hashtags to be located at the hashtags representing similar and coherent topics. In order to select only
the relevant themes for the community and filter out noise,
we used a frequency threshold θf and labelled a community C by only topics with fjC ≥ θf . We observed that
choosing θf = 2, i.e., filtering out just the hashtags that
were used only once by a member of a community, already
significantly improved the quality of the labels. This also
confirmed our assumption about topical coherence of the
mined communities.
While hashtag usage is a natural and intuitive approach
for labelling of communities, many users do not use hashtags in their tweets. Hence we believe the labelling process can be improved by more advanced text processing
techniques like keyword/key-phrase extraction, named entity recognition, etc. We further discuss this in Section 4.
3.3
Ranking is Vital
As mentioned in Section 2.2, there is an unavoidable tradeoff between having too much noise in the system vs. having
not enough interesting and relevant information. As such,
we cannot avoid to always find communities and topics that
are not perfectly (or better, obviously) related to the event.
The only way to overcome this and to still provide a satisfying user experience is an appropriate ranking. With a good
ranking, the crucial and relevant information will always be
the first the user sees – and rather unrelated information is
shown at the bottom of the groups list.
Thus, after the communities were identified and labelled,
multiple ranking strategies were used. First, as we have
already described, an importance of a topic (hashtag) j
within a community C is measured by its joint frequency
fjC . Second, the communities were ranked according to
their correspondence with the global focus of the application. Third, the members of each community were ranked
in order to present the most relevant tweets in each community.
The global focus of the application is determined by the
set of p seed hashtags S = {h1 , . . . , hp }, which are used
to obtain the tweets relevant to the event (see Section 2.2).
Our first approach was to measure that similarity by Jaccard coefficient between the seed set and the set of community hashtags. This works well, but can have a drawback in
certain situations. If a community has many hashtags assigned, which cover a significant fraction of the seed hashtags, it can still be ranked lower than a community with
only a few hashtags that cover only a small part of the seed
hashtags. Thus, in our current version, we remove that dependence on the number of hashtags in a community and
use only the number of overlapping hashtags:
rC = |S ∩ H|, H = {j : fjc ≥ θf }
We are still analysing which of the two rankings is better suited in which situations. As a matter of fact, both
provide a significant improvement of the results forwarded
to the mobile client. If two communities have the same
rank rC1 = rC2 , then the one with more hashtags is ranked
higher. The intuition behind this is that a community with
more hashtags is more specific and should thus be listed
first. If also the number of hashtags is equal, then they are
finally ranked by their size, i.e., the bigger communities are
given priority.
In order to rank the users within a community, we first
induced a subgraph of the INTERACTS WITH network by
2000
Added seed
terms
1800
Removed
seed
terms
1600
Topic
left the app
Topic appeared in the app
1400
Tweet Count
including only the members of the community and links between them. Then the nodes were ranked by PageRank. We
used particularly PageRank because it has been frequently
used as a heuristic for capturing influence and authority of
actors in social networks [Sun and Tang, 2011]. Finally,
tweets in the mobile application are ordered by decreasing
PageRank value of their authors. For further versions we
plan to provide the option to the user to switch to a purely
time-based ordering, as it is provided by standard Twitter
applications. We further consider experimenting with other
social ranking mechanisms, such as taking the actual user,
his preferences and personal networks into account [Gou et
al., 2010; Haynes and Perisic, 2009]. One promising approach is to apply a tailored notion of PageRank like the
FolkRank proposed in [Hotho et al., 2006].
Our hypothesis is that the most central users in the observed networks are the most important ones – and the ones
that most frequently initiate the most important news. As
such, we also decided to follow these users until the next
community analysis is performed (currently in 1 hour intervals). The 33% or top-3 users (whichever is lower) are
therefor signalled to the filtering component, which cares
for an appropriate extension (see Section 2.2). Further, the
mobile application shows the last received tweet from the
two highest ranked users of each community on its main
screen (see Figure 1). It is thus of importance that these
tweets are indeed the most recent ones from these users.
1200
1000
800
600
400
200
0
12 14 16 18 20 22
0
23
2
4
6
8
10 12 14 16 18 20 22
0
2
4
6
24
8
10 12 14 16 18 20 22
25
0
2
4
6
8
10
26
Days/Hours of day
Figure 5: “Topic latency”: timely detection, late removal
proach to a group-centred one works out. With first tests
based on pure topic clustering in the analysed time slices,
we found that, as expected, large topics are “swallowing”
smaller ones – in our approach the listed groups and topics
range from small to large. Nevertheless, we experienced a
sort of “topic latency” problem, as shown in Figures 5 and 6
for two events during the race. Figure 5 shows the number
of tweets over time for a very popular band and indicates
the time span that we observed a corresponding topic in our
app. As one can see, a large number of tweets results in an
accurate early detection – but also in a belated removal of
the topic. In contrast, an event with a rather low number of
tweets (Figure 6) can result in a belated topic display but a
timely removal. We plan to tackle this issue as follows:
10
Lessons Learned and Future Work
Our general conclusion is positive, as we can say “It
works!”. We continuously monitored the application and
the determined groups and topics before and during the
Volvo Ocean Race in Galway. Indeed, the application displayed mostly relevant groups with interesting and relevant
topics. This ranged from small to large topics and clearly
separated certain groups of interest, such as around the race
teams or certain events – which, as a byproduct, also separated groups of particular languages. Still, as we discussed
before, it is unavoidable to get certain topics into the application that are not obviously related to the event. To our
relief, the applied ranking method always positioned them
in the lower end of the groups list. Still, we plan to run experiments with different, less restrictive community detection methods to assess the quality of the currently identified
groups – which are often rather small.
There are many important lessons we have learned and a
wide range of identified challenges and research topics for
future work. They arose from the mere focus on developing
a real-time analytics application with a user-friendly interface, and also from issues we observed regarding the applied concepts and general design choices. One of the most
crucial challenges lies in the actual analytics. Our batchbased approach can only work because we handle a rather
limited part of the whole Twitter stream (due to the filtering). This might not always be the case and appropriate.
Moreover, any method based on a priori defined fixed windows is usually prone to provide sub-optimal results. As
such, we plan to intensely investigate methods for real-time
community detection and analysis. Our main focus will be
on how to achieve a community monitoring that first supports the identification of change points, which signal the
need for re-running the network clustering. Later, we will
assess opportunities to do the whole community detection
itself in a streaming fashion.
The general idea of moving from a topic-centred ap-
8
Removed
seed
terms
Topic pp
appeared in the app
Added seed
terms
9
Topic
left left
the
app
7
Tweet Count
4
6
5
4
3
2
1
0
18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14
21
22
23
24
25
Days/Hours of day
Figure 6: “Topic latency”: late detection, timely removal
1. Decouple the topic detection from the actual group detection. While we have to use a larger window for
finding a community signal in the networks, this is not
necessary for the topic labelling of a community. We
plan to mine these topic labels in shorter intervals for
the found communities, which will also reflect topic
changes in groups more accurately.
2. We identified the need for integrating a more sophisticated method for trend and event detection. This
should be combined with a method for detecting
bursts, which we already plan to incorporate into the
filtering component (see Section 2.2). In order to be
able to do this in (near) real time, we consider to
adopt more stream-mining approaches and will investigate cloud-based computing frameworks [Vakali et
al., 2012].
3. Include NLP techniques to extend the topic labelling
from considering only hashtags to general keywords.
First tests in this direction were promising. This will,
however, require a higher weighting of hashtags, as
they are generally understood as topic labels by the
Twitter community.
During the development of the app, we were surprised to
see how crucial the quality of the filtering component actually is. Similar to the above, we identified the strong need
for advanced burst and trend detection methods, as well
as other suited knowledge discovery and machine learning
techniques. In order to make the application really work
smoothly, we have to achieve an almost 100% automated
maintenance of the filters. In fact, feeding the filters is currently the one component that requires the most manual input and benefits the least from automated analyses. Nevertheless, this manual and laborious process helped us to
clearly identify the main challenges and problems we will
have to face when aiming for an automated approach:
• Our results are very reliant on Twitter users behaviour,
e.g., retweeting, usage of Twitter lists, accurate user
descriptions and the usage/non-usage of hashtags.
This makes it very challenging to remove the manual
element to finding suitable users and terms. Which
data mining techniques are suited to help here? Can
they be applied out of the box or do they have to be
adopted to the particular use case?
• The decision-making for including/excluding users
and terms is very subjective. For instance, should we
include all users on a relevant Twitter list or should
we filter them manually? Should we exclude users
who are mostly advertising products? Maybe some of
these products are interesting to the users of the app.
Should we put all global hashtags on the greylist, all
topics trending on Twitter? This might be problematic
when considering larger events, such as the Olympics,
which bear a global character in themselves.
Similar holds for the importance of post-processing, i.e.,
what and how to show in the mobile client. Should we display retweeted messages just because they are most recent,
or rather only the original tweets? What minimal group
size actually makes sense? Should tweets appear in multiple groups if the author belongs to more than one? These
things are also related to the crucial aspect of appropriate
ranking techniques. This is in the first line concerning the
ranking of groups and thus topics. However, it also refers to
the ranking of users, tweets and keywords to label topics.
In the same line, we believe that the application provides
perfect ground for research on personalisation. We envision to include personal preferences, personal social networks, own tweets and topics, etc. This will strongly benefit from the aforementioned social ranking techniques, but
also from research in the area of user interfaces and HCI.
Last but not least, the probably expected statement about
the many unexpected issues: Never underestimate Murphy’s law! Although the app was running smoothly for
days before the actual event, at the time of most importance
we ran into issues with temporary outages of our Twitter
connection, hardware failures, etc. This can in parts be
accommodated by unit tests, much more intense than the
norm for any research prototype. Similar, one tends to underestimate the efforts required to build such a complete
architecture and the wide range of issues, from tiny to crucial, one has to face during the design and development. As
researchers we are not primarily used to creating working
applications in a given time – for us this was a very good
lesson about all the glitches and itches around that.
Acknowledgements
This work was jointly supported by the European Union
(EU) under grant no. 257859 (ROBUST integrating
project) and Science Foundation Ireland (SFI) under grant
no. SFI/08/CE/I1380 (LION-2).
References
[Cha et al., 2010] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. Measuring
User Influence in Twitter: The Million Follower Fallacy.
In ICWSM, 2010.
[Fortunato, 2010] Santo Fortunato. Community detection
in graphs. Physics Reports, 486:75–174, 2010.
[Glasgow et al., 2012] Kimberly
Glasgow,
Alison
Ebaugh, and Clayton Fink. #londonsburning: Integrating geographic, topical and social information during
crisis. In ICWSM, 2012.
[Gou et al., 2010] Liang Gou, Xiaolong (Luke) Zhang,
Hung-Hsuan Chen, Jung-Hyun Kim, and C. Lee Giles.
Social network document ranking. In JCDL, pages 313–
322, 2010.
[Greene et al., 2010] D. Greene, D. Doyle, and P. Cunningham. Tracking the evolution of communities in dynamic social networks. In ASONAM, pages 176–183,
2010.
[Haynes and Perisic, 2009] Jonathan Haynes and Igor
Perisic. Mapping search relevance to social networks.
In Workshop on Social Network Mining and Analysis
(SNA-KDD), pages 2:1–2:7, 2009.
[Hotho et al., 2006] Andreas Hotho, Robert Jäschke,
Christoph Schmitz, and Gerd Stumme. Information
retrieval in folksonomies: search and ranking. In
ESWC, pages 411–426, 2006.
[Karnstedt et al., 2009] M. Karnstedt, D. Klan, Chr.
Politz, K.-U. Sattler, and C. Franke. Adaptive burst detection in a stream engine. In SAC, pages 1511–1515,
2009.
[Klan et al., 2011] D. Klan, M. Karnstedt, K. Hose,
L. Ribe-Baumann, and K. Sattler. Stream Engines Meet
Wireless Sensor Networks: Cost-Based Planning and
Processing of Complex Queries in AnduIN. Distributed
and Parallel Databases, 29(1):151–183, January 2011.
[Kwak et al., 2010] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010.
[Lancichinetti et al., 2011] A. Lancichinetti, F. Radicchi,
J.J. Ramasco, and S. Fortunato.
Finding statistically significant communities in networks. PloS one,
6(4):e18961, 2011.
[Lin et al., 2009] Y.R. Lin, J. Sun, P. Castro, R. Konuru,
H. Sundaram, and A. Kelliher. Metafac: community
discovery via relational hypergraph factorization. In
SIGKDD, pages 527–536, 2009.
[Mucha et al., 2010] P.J. Mucha, T. Richardson, K. Macon, M.A. Porter, and J.P. Onnela. Community structure
in time-dependent, multiscale, and multiplex networks.
Science, 328(5980):876, 2010.
[Sun and Tang, 2011] J. Sun and J. Tang. Social Network
Data Analytics, chapter A survey of models and algorithms for social influence analysis, pages 177–214.
Springer, 2011.
[Vakali et al., 2012] Athena Vakali, Maria Giatsoglou, and
Stefanos Antaris. Social networking trends and dynamics detection via a cloud-based framework design. In
Workshop on Mining Social Network Dynamics (MSND
–WWW Companion), pages 1213–1220, 2012.
Download