Visualizing Trends on Twitter Cheng Hau Tong

advertisement
Visualizing Trends on Twitter
by
Cheng Hau Tong
B.S., Massachusetts Institute of Technology (2008)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2013
c Massachusetts Institute of Technology 2013. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
August 26, 2013
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Samuel Madden
Professor
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Albert R. Meyer
Chairman, Masters of Engineering Thesis Committee
2
Visualizing Trends on Twitter
by
Cheng Hau Tong
Submitted to the Department of Electrical Engineering and Computer Science
on August 26, 2013, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
With its popularity, Twitter has become an increasingly valuable source of real-time,
user-generated information about interesting events in our world. This thesis presents
TwitGeo, a system to explore and visualize trending topics on Twitter. It features an
interactive map that summarizes trends across different geographical regions. Powered by a novel GPU-based datastore, this system performs ad hoc trend detection
without predefined temporal or geospatial indexes, and is capable of discovering trends
with arbitrary granularity in both dimensions. An evaluation of the system shows
promising results for visualizing trends on Twitter in real time.
Thesis Supervisor: Samuel Madden
Title: Professor
3
4
Acknowledgments
I would like to thank Professor Samuel Madden for his invaluable ideas, support and
guidance throughout the course of this research. He has been a great mentor and I
am greatly indebted to him.
I would also like to acknowledge Todd Mostak, the creator of MapD for the opportunity to work with him on this project. I thank him for his inspiration and showing
me the wonder of GPU computing. I also thank Adam Marcus for his incredible help,
advice and the brainstorming sessions we had.
Finally, I would like to thank my family and friends for their support over the
years.
5
6
Contents
1 Introduction
11
2 Background
15
2.1
MapD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2
TweetMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3.1
Database Operations on GPU . . . . . . . . . . . . . . . . . .
17
2.3.2
Trend Detection on Twitter . . . . . . . . . . . . . . . . . . .
18
3 Trend Detection
3.1
19
Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1.1
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.2
COUNT with GROUP BY . . . . . . . . . . . . . . . . . . . .
22
3.2
Trend Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.3
Partial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4 User Interface
29
4.1
Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2
Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.3
Tweets Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.4
Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5 Evaluation
5.1
39
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
39
5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
5.2.1
System Performance . . . . . . . . . . . . . . . . . . . . . . .
40
5.2.2
Histogram Construction . . . . . . . . . . . . . . . . . . . . .
41
5.2.3
Partial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . .
43
6 Conclusion
45
6.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
A Trend Detection Statistics
47
B Query Plans
49
8
List of Figures
3-1 Trend detection inputs. . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4-1 Visualization interface design. . . . . . . . . . . . . . . . . . . . . . .
30
4-2 Visualization interface screenshot. . . . . . . . . . . . . . . . . . . . .
30
4-3 Map layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4-3 Map layers cont.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4-4 Geo Trends layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4-5 Timeline display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5-1 Response time of trend query. . . . . . . . . . . . . . . . . . . . . . .
41
5-2 Histogram construction time. . . . . . . . . . . . . . . . . . . . . . .
42
5-3 Performance of partial sorting. . . . . . . . . . . . . . . . . . . . . . .
44
9
10
Chapter 1
Introduction
Twitter is a social media platform that allows users to share topics and conversations
in a short text message limited to 140 characters. Since its launch in 2006, Twitter
has become a valuable source of real-time, user generated content about interesting
world and local events. In fact, mainstream media today is increasingly using Twitter
to source stories for major news events.
When certain topics become popular on Twitter, users are often interested in the
context about these events. In particular, they want to know where and when these
events took place, and how the stories progress over time. Even more so, users are
curious about the opinions of the public or some popular figures towards these stories,
and how the responses might differ in their local neighborhood versus other locations.
However, identifying emerging topics from a large collection of documents and
turning this data into meaningful insights is a challenging problem. A prerequisite for
detecting events in a temporal stream is to first obtain a representation of its activity
over a sequence of evenly spaced time intervals, otherwise known as time series. The
procedure of grouping data points from successive time, usually accomplished by a
COUNT with GROUP BY query on a database, can be time consuming and does
not scale well as the data set grows larger.
In order to work around this limitation, some systems resort to precomputing the
time series instead of calculating them on the fly. However, this approach is inflexible
and imposes additional constraints on analytics that can be derived from the time
11
series. For example, suppose we precompute some time series representing the hourly
count of tweets originated from the United States. From this data, we cannot tell
what percentage of these tweets comes from the Boston area, or contains the word
“Obama”, or is written by user John Doe.
Our goal is to visualize trending topics on Twitter, which requires sub-second
latency for best user experience. However, this is not the performance most relational
systems are engineered to provide, especially over large data sets. In this thesis, we
explore a different approach to tackle the problem of trend detection on a large data
set. Instead of relying on precomputed statistics, we take advantage of a graphics
processing unit (GPU) to speed up data retrieval and computations. In particular,
its high memory bandwidth reduces latency waiting for new data and its parallel
processing power allows for computations on more data at the same time. As we
perform all calculations as needed, this approach also enables us to detect trends
on an arbitrary subset of the original data, and not be limited by any predefined
parameters.
In this thesis, we present TwitGeo, a system for visualizing trends on Twitter.
Our contribution is a parallel trend detection algorithm using MapD [12], a fast,
GPU-based datastore as the back end. Specifically, we decompose the problem into
three phases—grouping, statistics computation and ranking, and present a parallel
implementation for each phase. We analyze the performance of our implementation
against a traditional database, and discuss cases where one outperforms the other. As
an application of our parallel back end, we describe a visualization interface we have
built that displays trending keywords on Twitter. We demonstrate how our approach
allows for trend detection on tweets with arbitrary temporal and spatial granularity
in absence of any indexes on these columns.
We begin in Chapter 2 with previous work on database operations on GPU and
trend detection on Twitter. In this chapter, we also provide an overview of MapD.
In Chapter 3, we describe our parallel implementations of the three phases in trend
detection. Chapter 4 presents our visualization interface that allows users to see
trend patterns in both space and time. Chapter 5 describes how we evaluate the
12
performance of our system and discusses the results. Chapter 6 concludes this thesis
and outlines directions for future work.
13
14
Chapter 2
Background
In this chapter, we begin with a brief overview of our novel database back end in
Section 2.1 and its existing visualization interface for Twitter in Section 2.2. We
discuss related work in database implementations on GPU and trend detection on
Twitter in Section 2.3.
2.1
MapD
The Massively Parallel Database (MapD) is a novel parallel datastore created by Todd
Mostak, a fellow researcher at MIT. Built on top of NVIDIA CUDA platform, the
system uses GPU’s massively parallel processing power and high memory bandwidth
to speed up common database operations, resulting in as much as a 70× speedup in
comparison to CPU-based systems [12].
The architecture of MapD is designed to optimize query performance by using
parallelism efficiently. While MapD stores its data on disk in memory-mapped files,
it caches as much data as possible in the GPU memory. This eliminates unnecessary
data transfer from the system to the GPU because such transfers provide lower bandwidth than memory accesses within the GPU. In most queries, a database performs
the same computation on items from the same column, and these operations can be
carried out independently. By storing its data in columnar format, MapD maximizes
data parallelism and allows the GPU to compute on many items at the same time. In
15
addition, it improves memory performance due to the coalesced pattern of its memory
access.
MapD has great query performance, often taking just milliseconds to access millions of rows, especially with aggregation queries. This performance makes it a great
tool for computing analytics that might otherwise take hours on CPU. To speed up
operations on text based columns, it splits text into words and translates them into
unique integer identifiers known as word ID’s. The additional benefit of doing so is
that this representation has a smaller memory footprint and allows for more data to
be stored in the GPU.
The system features a SQL interface, which allows other researchers to access the
database with little learning curve. To facilitate data migration, MapD supports data
import from other databases including MySQL and PostgreSQL. In addition, MapD
also provides some functionality specific to geospatial applications, such as heat map
and geolocation map rendering.
However, as a new database still in development, MapD has its own limitations.
The system cannot handle data that is not cached in the GPU memory, and requires
explicit instruction from users to load the data. Because of this, its working data
size is limited by the amount of memory available on the GPU’s, which makes it less
useful for big data applications. Presently, the largest memory available on a single
top notch GPU is 12 GB [13].
This thesis extends the functionality of MapD by providing a parallel implementation of grouping in GPU.
2.2
TweetMap
Another visualization powered by MapD, TweetMap [11] is the predecessor of our
interface. Both interfaces are geospatial centric and aim to highlight spatial patterns
observed from tweets. While TweetMap is a great tool to visualize the spatial distribution of tweets, it provides limited context and insight into the content of the
tweets. Our interface addresses these needs by showing users the trending keywords
16
over different geographical regions on map. In addition, we introduce new elements
to our visualization to be more interactive and engaging, as detailed in Chapter 4.
2.3
2.3.1
Related Work
Database Operations on GPU
Here we give an overview of previous work on database implementation on GPU. Prior
to the availability of General-Purpose Graphics Processing Unit (GPGPU) programming languages, researchers exploited direct access to texture memory on GPU to
perform vector computations. Govindaraju et al. presented algorithms for common
database operations including predicates, boolean combinations, and aggregations [5],
and later on sorting [4], all implemented in OpenGL API. Still at their early stage,
some database operations on GPU were already seeing performance gain over CPU
implementation.
The introduction of NVIDIA CUDA platform in 2007 provides developers a programming interface to realize the massively parallel processing power of GPU for
database operations. He et al presented an in-memory relational database, GDB
with supports for selection, projection, ordering, joins, and grouping and aggregation [6]. Bakkum and Skadron observed a performance gain upwards of 20× on their
GPU implementation of a SQLite database against its CPU counterpart, but omitted
support for queries with grouping [1].
The most closely related work is the study by Zhang et al. to aggregate spatial and
temporal data from taxi trips in New York City on GPU [16]. The authors presented
a parallel grouping algorithm implemented by using sort and reduce by key data
parallel primitives from the Thrust library. In order to count taxi pickup records by
street and hour, they combined a street identifier vector with a pickup hour vector
to form a key vector, sorted the key vector, and counted the occurrence of each
unique key in the vector. We take a similar approach in Section 3.1 to construct our
histogram for trend detection.
17
While CUDA is becoming stable and mature, recent work in the field is showing
increasing interest in device agnostic parallel programming platform. Heimel et al.
presented Ocelot, an extension of the open-source columnar database MonetDB [7].
Unlike MapD, Ocelot is implemented in OpenCL and can take advantage of parallelism in both multi-core CPU and GPU. The authors implemented common database
operations mostly based on existing work from the area, and also introduced a hash
table implementation for their grouping operator.
2.3.2
Trend Detection on Twitter
There are numerous tools that help users explore interesting topics on Twitter, but
a majority of them do not offer users the flexibility to define the scope of data on
the fly. On one hand, some works on prefiltered data set: Eddi [3] uses a user’s feed;
TwitInfo [9] requires users to predefine a set of keywords. On the other hand, GTAC
[8] detects geo-temporal events from around the world. Our approach is to combine
the best of both worlds by allowing the users to be as broad or specific as they wish.
This requires a fast trend detection algorithm capable of handling a large collection
of unprocessed tweets.
While researchers use different definitions of trends on Twitter, we can generalize
their models into keyword-based versus topic-based. Benhardus [2] used a keywordbased approach to identify trends from n-grams; whereas Mathioudakis and Koudas
[10] defined topics as groups of bursty keywords with strong co-occurrences. Ramage
et al. [14], Wang et al. [15], and Lau et al. all used variants of Latent Dirichlet
allocation (LDA), a common topic model to discover the underlying topics in tweets.
While a topic model might offer better contextual insights, we choose the keywordbased approach for its simplicity and apparent data parallelism.
18
Chapter 3
Trend Detection
In this chapter, we present a parallel implementation of trend detection on Twitter.
Before diving into the details, we first define the problem. The input given is a list
of tweets within a given time range and a given map viewport. Each tweet contains
a timestamp, a coordinate and some text. We then divide the viewport evenly into
M ×N square grid cells. For each grid cell, return top K words that are most trending
among all tweets contained within that cell. We broadly define a word as trending
when its mean normalized word frequency is significantly higher in the more recent
windows of time compared to earlier windows.
At a high level, we can decompose this problem into several pieces. First, for each
unique word that occurs in the input tweets, we need to understand its spatial and
temporal distribution, that is we need to obtain a 3-D histogram—two dimensions
for space and one for time. On a database, this can be accomplished by a SQL query
below, where binX, binY and binTime denote identifiers to M bins in x-coordinates,
N bins in y-coordinates and L bins in time, respectively.
SELECT binX, binY, binTime, word, COUNT(*)
FROM tweets
GROUP BY binX, binY, binTime, word;
Given this histogram, we then compute statistics that will allow us to detect trends
observed in the time-series histogram for each word in each space bin. Finally, we
19
need to sort words in each space bin by their previously computed trending statistics.
In Section 3.1 - 3.3, we explore how these subproblems can be written to take
advantage of a GPU’s massively parallel processing power as well as its high memory
bandwidth.
3.1
Histogram
Let us further examine the inputs of our trend detection algorithm. Let T denote the
number of input tweets, and U be the total number of words in those tweets. The
actual inputs are arrays representing fields in tweets, as shown in Figure 3-1. The
timestamp, x-coordinate and y-coordinate arrays each has cardinality T. The MapD
server represents the text in a tweet as a variable-length array of 32-bit integers known
as word identifiers (word ID’s), each of which uniquely identifies a word. Therefore,
the server returns text in tweets in the form of a word array of size U containing the
word ID’s. To retrieve the word ID’s, we are provided with an offset array of size T
which contains indexes into the word array indicating the first word ID of each of the
T tweets.
timestamp
t0
t1
tT −1
x-coordinate
x0
x1
xT −1
y-coordinate
y0
y1
yT −1
offset
o0
o1
o2
word
w0
w1
w2
oT −1
w3
w4
w5
wU −1
Figure 3-1: Trend detection inputs consist of five arrays: timestamp, x-coordinate,
y-coordinate, offset and word. The array offset contains indexes into word indicating
the first word ID of each of the tweets.
20
In the sections that follow, we describe how we preprocess the input arrays and
construct a histogram later on.
3.1.1
Preprocessing
Before we calculate the histogram, we filter out stop words such as “I”, “am”, “going”,
etc, which do not add much context to a sentence and should not be considered as
a trending word. This step improves the performance of histogram construction by
reducing the data size significantly. We also annotate each word occurrence (i.e. each
element in array word ) with its x, y and time-bin identifiers (bin ID’s), based on the
number of space and time bins, the bounding box of the viewport, and the time range
of the tweets. In general, given a number of bins M and a value range [X0 , X1 ], a bin
ID (zero-based) is calculated as
binX = b
x − X0
× Mc
X1 − X0
Algorithm 1 details how we perform these preprocessing steps in parallel. We first
load the word ID’s of the stop words to a bitarray, stopWords. We also allocate an
array binEntry of size U to store the bin entries. We define a bin entry as a 64-bit
integer that describes an occurrence of a word by its x, y and time-bin ID’s, and its
word ID. Its value is given by
binEntry = binY 56 | binX 48 | wordId 16 | binTime
that is 8 bits allocated for y-bin ID, 8 bits for x-bin ID, 32 bits for word ID, and 16
bits for time-bin ID. We will see in Section 3.2 what roles this memory layout plays
in the implementation of our trend detection function. Continuing with Algorithm
1, in parallel, each GPU thread preprocesses a single tweet. For each word in the
underlying tweet that is not a stop word, the thread writes the value given above to
binEntry, or 0 otherwise.
After binEntry is populated, we move all non-null bin entries to the head of the
21
Algorithm 1 Creating bin entries and filtering stop words
Require: Array binEntry of size U
1: function CreateBinEntries(x, y, time, word, threadId )
2:
calculate binX from x, binY from y, and binTime from t
3:
base ← binY 56 | binX 48 | binTime
4:
location ← offset[threadId]
5:
end ← offset[threadId + 1]
6:
while location 6= end do
7:
wordId ← word[location]
8:
if wordId 6⊂ stopWords then
9:
binEntry[location] ← base | wordId 16
10:
else
11:
binEntry[location] ← 0
12:
end if
13:
end while
14: end function
array, and the rest to the tail by calling a parallel thrust::remove if function in
the Thrust library, a parallel algorithms library which resembles the C++ Standard
Template Library (STL). This function returns an iterator pointing to the last location
in the array where the bin entry is not zero.
3.1.2
COUNT with GROUP BY
Now that each occurrence of a word is annotated with its bin ID’s in the form of a
bin entry, we can implement a COUNT with GROUP BY function in a few steps
outlined in Algorithm 2, which takes much advantage of a rich collection of data
parallel primitives provided by the Thrust library.
Each occurrence of the same word in the same bin has the same binEntry value.
To group these occurrences together, we sort the binEntry array in parallel by calling
thrust::sort. We write each unique bin entry and the number of times it occurs into
to the arrays uniqueBinEntry and wordCounts, respectively. However, before we can
allocate these arrays, we need to find out their size by counting the unique values in
binEntry. This can be accomplished by calling thrust::adjacent difference and
thrust::count parallel functions. We also allocate an auxiliary array ones of size U
and fill it with 1’s. Finally, we count the number of occurrences for each unique bin
22
entry value. We do so by calling thrust::reduce by key, which sums consecutive
ones values of the same binEntry value in parallel, and then writes each unique bin
entry and its sum of ones to uniqueBinEntry and wordCounts, respectively.
Algorithm 2 Counting word occurrence by bins and words
Require: Array binEntry
1: function CountByGroups
2:
sort binEntry
3:
V ← number of unique values in binEntry
4:
allocate array ones of size U and fill with 1’s
5:
allocate array count of size V
6:
wordCounts ← sum of consecutive ones elements of the same binEntry value
7: end function
We now have a histogram in the form of a (uniqueBinEntry, wordCounts) array
pair.
3.2
Trend Computation
In this section, we present how we detect trends via a parallel implementation of
Welch’s t-test. Also known as an unequal variance t-test, Welch’s t-test is used
to test if two populations have the same mean without assuming that they have the
same variance. It defines the statistic t as follows:
X1 − X0
t= q 2
s2
s0
+ N11
N0
where Ni denotes the sample size, Xi denotes the sample mean and s2i denotes the
sample variance of sample i.
In our applications, we obtain the two samples by constructing a histogram with
two time bins, namely the before, X0 , and the after, X1 . For each space bin, its
sample Xi for word ID w denotes a set obtained by testing if each word ID from time
bin i is equal to w. Its sample size is the total number of words in time bin i and
that space bin; whereas its sample mean is the number of occurrences of word ID w
23
in time bin i normalized by its sample size. See Appendix A for the derivation of the
sample means and sample variances.
Given sufficiently large sample sizes (≈ 30), the t-value can be used with the
standard normal distribution to test the null hypothesis that the samples have the
same population means. The larger the t-value, the more evidence suggesting the
after ’s normalized word frequency of word w is larger than the before’s, or simply
put, word w is more likely to be trending in that space bin.
From the histogram, we have the word counts for each bin entry (binX, binY,
binTime, wordId ). However, this test also requires the sample sizes, which are the
total words counts (over all words) for each bin ID (binX, binY, binTime). We obtain
these by summing the individual word counts over the word ID’s for each bin ID, and
write the result to array binWordCounts.
Given these input arrays, we then implement a parallel version of Welch’s t-test
for CUDA (see Algorithm 3). Each GPU thread computes the t-values of multiple
bin entries, and writes the results to array t. This algorithm requires repeated,
uncoalesced access to the binWordCounts array located on the GPU’s global memory.
Our implementation takes advantage of the GPU’s shared memory, which has roughly
100× the bandwidth of its global memory. If the number of (space and time) bins
is small enough, we store a local copy of the binWordCounts array in the shared
memory to eliminate excessive access to the global memory.
As mentioned in section 3.1.1, we concatenate binY, binX, wordId and binTime to
form a 64-bit bin entry value. Their byte order in a bin entry ensures that a before’s
bin entry, if exists, always locates on the left of its after ’s counterparts (from the
space bin, of the same word ID) in the sorted uniqueBinEntry array. In other words,
we can find the before’s and after ’s counterparts quickly in a parallel environment.
Because there is only one t-value per space bin per word ID, array t may contain
interleaving zero’s, which we remove by calling thrust::remove if. This function is
stable, so all t-values maintain their original order in the array.
24
Algorithm 3 Welch’s t-test
Require: uniqueBinEntry, wordCounts, binWordCounts arrays.
1: function ComputeWelch(threadId )
2:
allocate array w of size of uniqueBinEntry
3:
allocate array t of size of uniqueBinEntry and fill with 0’s
4:
allocate array sBinWordCounts of size of binWordCounts in shared memory
5:
sync threadId < size of binWordCounts
6:
sBinWordCounts[threadId] ← binWordCounts[threadId]
7:
end sync
8:
for all binEntry0, binEntry1 assigned to threadId do
9:
retrieve binX, binY, binTime, wordId from binEntry0
10:
retrieve binX, binY, binTime, wordId from binEntry1
11:
if binTime1 = 1 then
12:
if binEntry0 and binEntry1 have matching binX, binY, wordId then
13:
wc0 ← wordCounts[binEntry0]
14:
else
15:
wc0 ← 0
16:
end if
17:
wc1 ← wordCounts[binEntry1]
18:
bwc0 ← 1 + sBinWordCounts[binX, binY,binTime = 0]
19:
bwc1 ← 1 + sBinWordCounts[binX, binY,binTime = 1]
20:
w[threadId] ← wordId
21:
t[threadId] ← welch(wc0, wc1, bwc0, bwc1)
22:
end if
23:
end for
24: end function
3.3
Partial Sorting
In the previous step, we obtained t-values that are ordered by word ID for each space
bin. In order to get the top K words that are most likely trending in each space bin,
we need to reorder them by their t-values. While Thrust provides an efficient parallel
sort function for a large array, what we really need is a partial sort function that
sorts hundreds of smaller arrays in parallel and returns the top K elements in each
array in decreasing order. The alternative is to partial-sort these arrays on the CPU,
but that does not make good use of the GPU’s parallel processing power.
In the following section, we describe our parallel partial sort implementation for
GPUs. Its inputs are a keys array, a values array, and an offset array of size J, where
J is the number of sub-arrays that are concatenated to form keys and values. Offset
25
contains the indexes pointing to the beginning of these sub-arrays. The outputs are
arrays sorted keys and sorted values, both of size J × K.
Because sub-arrays are independent of each other, our strategy is to launch a
grid of thread blocks, each of which in charge of a (keys, values) sub-array pair.
Threads within each block sort their assigned sub-array pairs in parallel, so they
need to cooperate with each other by sharing data and synchronizing. Here is how
the algorithm works. Each block retrieves from its sub-array pair a block size—the
number of threads within each thread block—of elements at a time. Then, it partialsorts this block of (keys, values) pairs by their value and stores the top K elements to
a temporary array temp1 on the shared memory. The block also maintains another
temporary array temp2 to store the top K elements from previous iterations, and it
updates temp2 by merging the array with temp1. This process is repeated until all
elements in the sub-array have been exhausted. See Algorithm 4.
Algorithm 4 Partial-sorting multiple arrays in parallel
Require: Arrays keys, values and offsets
1: function PartialSort(blockId, threadId )
2:
allocate arrays temp1 and temp2 of size K and fill with 0’s
3:
offset ← offsets[blockId]
4:
size ← offsets[blockId + 1] − offset
5:
i ← threadId
6:
while i < size do
7:
temp1 ← BlockPartialSort()
8:
merge temp1 to temp2
9:
i ← i + blockSize
10:
end while
11:
write temp2 to sorted keys and sorted values
12: end function
Until now, we have not explained how a thread block partial-sorts a block of
elements. This algorithm is implemented in two phases, namely a warp-level sort
and a K-pass merge. In CUDA architecture, instructions are issued per group of 32
threads referred to as a warp. Therefore, threads within a warp, also known as lanes,
are implicitly synchronized. NVIDIA devices with CUDA compute capability 3.0 and
above can execute a shuffle instruction SHFL, which allows a thread to read the register
26
of another thread within the same warp. This feature allows threads to exchange data
very efficiently at a warp level without the need for any shared memory, which has
a latency significantly higher than a register. We use this technique to implement
a quick warp-level bitonic sort so that each warp on the block ends up containing
a sequence of elements in decreasing order. The results are written to a temporary
array tempBlock.
We further take advantage of the shuffle instruction in our implementation of a
K-pass merge. Unlike the previous phase which uses all threads in a block, the merge
procedure only uses the first warp. In this phase, each lane is responsible for the
32-element sorted sequence generated by a particular wrap from the previous phase.
Initially, each lane stores the first element in its sequence—which has the greatest
value—into its register. In each iteration, we perform a warp-level reduction using
SHFL to find the max element among all lanes and save it to the next available slot in
temp1. The lane from which this block-level max originates then reloads its register
with the next element on its sequence. This process is repeated until we have found
the top K elements from the block or there is no data left.
27
28
Chapter 4
User Interface
In this chapter, we present a visualization interface that enables users to explore and
navigate Twitter events over both dimensions in a simplistic and intuitive fashion.
Specifically, this interface aims to provide users with the following:
• Quick summary: A quick glance of the interface should inform users what, when
and where trending topics are.
• Insightful details: Some clues about why a topic is trending.
• Analysis: Detailed examination on progression of a trend.
• Interactive user experience: An interface that motivates users to explore and
discover interesting events.
Figure 4-1 outlines the design of our visualization interface, which contains five
primary elements. Users may enter keywords and location in the search engine (see
Figure 4-1.1) to limit the scope of tweets, for example “Obama” near “Boston”. A
timeline chart (Figure 4-1.2) portrays the activity of matching tweets by volume over
time; the more tweets during a period, the higher its value on the chart for that
period. A tweets display (Figure 4-1.3) exhibits a subset of matching tweets. The
centerpiece of the interface is the map (Figure 4-1.5), which shows the locations of
all matching tweets, as well as trending words detected in the enclosed map region.
Figure 4-2 shows a screenshot of our implementation.
29
Figure 4-1: Visualization interface design. 1) Search engine to limit the scope of
tweets by keywords and location. 2) A timeline chart shows how the volume of tweets
changes over time. 3) A list showing a subset of matching tweets. 4) Users may
change the spatial granularity of the trend detection algorithm to small, medium or
large. 5) A map showing locations of matching tweets and trending words on Twitter.
Figure 4-2: Visualization interface screenshot.
30
Behind the scenes, the interface maintains a visualization context that describes
its current state, including the current viewport on the map, search terms and search
time range. User interaction with the interface changes this context, and consequently
triggers one or more elements to reload. In the following sections, we detail the
implementation of these elements and explain how they help achieve the goals we
previously laid out.
4.1
Search Engine
The search engine allows users to specify the terms and/or location which they would
like to explore. Both parameters are optional, and by default the search engine queries
all tweets in the region that is currently displayed on the map. If one or more search
terms are specified, the search engine queries only tweets that contain those search
terms. Users may specify a location in the form of a human-readable address, which
can be a street address, city, state or country.
Once users click “Submit” on the search engine, a new viewport is established
by translating the search location, if specified, using the Google Geocoding API. In
case of an ambiguous address, this API may return multiple results, but only the
first result returned will be used as our new viewport. The search engine changes the
search terms and current viewport (if changed) in the visualization context, causing
the timeline chart, tweets display and map to reload accordingly.
4.2
Map
In order to visually highlight the relationships between tweets from different geographic regions, we introduce an interactive map that is dynamically annotated with
aggregate data from tweets. We are particularly interested in two pieces of information: where Twitter users tweet from and what topics, if any, are showing unusually
high activity across different regions on the map. For those features, we need an open
source JavaScript mapping library that is flexible and easy to extend. We choose
31
OpenLayers.
The implementation of the map is an overlay of three layers. The base layer is
a Google Maps tile layer, which contains 256 × 256-pixel images showing political
borders and geography features such as rivers and roads. To show the geographical
distribution of tweets, we add a layer of image in which the background is transparent
and the foreground consists of colored pixels at locations tweets are sent from, as
depicted in Figure 4-3b. We refer to this layer as the Tweet Locations layer, which is
implemented as a Web Map Service (WMS) layer rendered on MapD server to take
advantage of its parallel processing power. Finally, we add another layer, called the
Geo Trends layer, on top to show trending keywords in different regions, as seen in
Figure 4-3c. Unlike the previous two layers, this interactive layer is generated on the
front end, based on trends data computed on MapD server. The following section
describes how we draw this layer in JavaScript and OpenLayers.
We divide the viewport into M by N invisible, square grid cells of a default size of
128 × 128 pixels, each of which represents a subregion on the map. Each subregion
will end up showing a single “most trending” keyword, if any, so that users can visually
associate each subregion with that keyword and get a quick summary of what topics
are trending across different subregions. To visually group related regions, neighboring cells sharing the same trending keyword are combined into a single polygon
with rectangular edges using Algorithm 5. To highlight the more significant events,
keywords that are trending in multiple subregions have their background polygons
outlined and colored, as shown in Figure 4-4.
Users can zoom or pan the map, causing the base layer to request additional tile
images implicitly. The Tweet Locations layer reloads by updating its WMS request
parameters to reflect the new viewport, and OpenLayers automatically takes care of
the server request and new image placement. To redraw, the Geo Trends layer needs
to recalculate its bounding box based on the new viewport and its grid size before it
makes a request to the MapD server. Here is some explanation for this requirement.
We design the Geo Trends Layer in such a way that for a given zoom level and
grid size, the corner of each grid cell is always aligned to a map coordinate that is
32
(a) Google Maps tile layer
(b) Tweet Locations layer
Figure 4-3: Map is a stack of three layers: (a) Google Maps tile layer, (b) Tweet
Locations layer and (c) Geo Trends layer.
33
(c) Geo Trends layer
(d) Map
Figure 4-3: Map is a stack of three layers. (d) shows an overlay from these three
layers.
34
Algorithm 5 Clustering neighboring nodes
Require: nodes is an array of grid cells on the map
1: function FormClusters(nodes)
2:
clusters ← empty array
3:
remaining ← nodes
4:
while remaining not empty do
5:
node ← remaining[0]
6:
cluster ← [node]
7:
neighbors ← node.GetNeighbors()
8:
while neighbors not empty do
9:
neighbor ← neighbors[0]
10:
if neighbor in remaining then
11:
remaining.remove(neighbor)
12:
cluster.add(neighbor)
13:
newN eighbors ← neighbor.GetNeighbors()
14:
neighbors.add(newN eighbors)
15:
end if
16:
end while
17:
clusters.add(cluster)
18:
end while
19:
return clusters
20: end function
an integer multiple of the grid size. In other words, each unique grid cell on the
map (as identified by its coordinate) always maps to the same geographical boundary
regardless of the current viewport of the map. What it means visually is when users
pan the map, the grid cells and their trending keywords, just like the base layer tile
images, follow the movement of the mouse.
In order to help users learn more about trending topics on Twitter, each trending
keyword is clickable, and as users do so, the trending keyword becomes the new
search terms of the visualization context. All visualization elements reload to show
only tweets that contain the clicked keyword, and most interestingly, the Geo Trends
layer is now showing other trending keywords that are closely related to the clicked
keyword. Through a series of mouse clicks, users can quickly develop a context about
these topics, and observe patterns in different regions across the map.
35
Snow
Montreal
Obama
Obama
Romney
Obama
Beach
Figure 4-4: Geo Trends layer is divided into square grid cells, each of which assigned
the most trending keyword based on tweets in that cell. Neighboring cells sharing
the same keyword are combined into a single polygon.
4.3
Tweets Display
In order to provide details about trending events on the map, we display a list of
tweets corresponding to the current visualization context. By using the twitter-text-js
library, we identify hashtags, user handles and links embedded in tweets and autolink
them to Twitter. This feature allows users to find out more about the tweets’ authors
and conversations on Twitter. Also, when users mouse over a tweet in the list, the
map pop ups a marker labeled with the author’s username at the location of the
tweet.
36
4.4
Timeline
The timeline display is a chart that illustrates how trending events progress over
time in a clear and meaningful way. The y-axis represents the volume of tweets
containing the search terms in the current visualization context; the x-axis represents
time. A quick glance of the timeline should give users some idea whether a topic is
currently trending upward, trending downward, or simply some random noise. Trends
recurring on a daily or weekly basis such as “morning”, “coffee” and “TGIF” also
become apparent on the chart.
The timeline also serves as a navigation tool by filtering tweets by time. Here is
how it works. As users zoom or pan the chart, the x-axis (time) scales or translates,
and the chart redraws itself accordingly. The time range of the visualization context
is directly tied to the chart. In other words, the tweets display as well as the Tweets
Location and Geo Trends layers on the map also reload to correspond to the new
time range.
We recognize that tweet activity is an absolute measure and does not offer insights
into the relative significance and timing of a trend. For example, a topic currently
trending at 50,000 tweets/hour does not tell users how the topic ranks against other
topics. To address this need, the timeline display allows users to compare the baseline
search terms against other terms. Figure 4-5 is based on tweets collected when Twitter
is flooded with real-time updates and responses to the Boston Marathon bombing
event on April 15, 2013. While both keywords “boston” and “prayforboston” are
trending, “boston” is relatively more popular. Also note how the latter slightly lags
the previous in becoming a trending event on Twitter.
37
Figure 4-5: Timeline display allows users to compare tweet activity of different events.
On April 15, 2013, as two bombs exploded near the Boston Marathon finish line,
“boston” and “prayforboston” became trending topics on Twitter.
38
Chapter 5
Evaluation
5.1
Experimental Setup
We examine the potential of our implementation for ad hoc trend detection on Twitter
based on two criteria:
• Query time: How long it takes to identify a list of trending topics on Twitter
within a given time range and geographical boundary. For best user experience
on the visualization interface, the MapD server should respond to a trend query
within minimum amount of time.
• Data size: The number of tweets our trend detection implementation can handle. As Twitter users easily generate millions of tweets each day, our system
should handle as much data as possible without severely sacrificing its performance.
We acquire millions of geo-tagged tweets from the Twitter Streaming API as our
data source and load them into the MapD server. To evaluate the overall system performance, we query trending topics based on different number of tweets and measure
response time of the JSON requests. We further inspect individual components of our
parallel implementation, namely histogram construction and partial sorting. Specifically, we compare their performance over different data sizes against their sequential
counterparts.
39
We ran our experiments on a 64-bit Ubuntu Linux 12.04 machine equipped with
two 6-Core Intel Xeon Processor E5-2620 2.00 GHz and 64 GB DDR3 1333 MHz
system memory. We use CUDA 5.0 as well as Thrust Library 1.5.3 that comes with
it. Our GPU device is an NVIDIA GTX 680, which has 4 GB GDDR5 memory with
a bandwidth of 192 GB/s, 8 Streaming Multiprocessors with 1536 CUDA cores and
CUDA compute capability of 3.0.
5.2
5.2.1
Results
System Performance
Figure 5-1 shows how the response time of a trend query scales as we increase the
query size, defined as the number of tweets the query has to read in order to derive
trending keywords. We fixed other query parameters at these values: 10 x-bins, 10
y-bins, and top 10 trends. Our data table contains 12.8 million tweets tweeted from
the Continental United States. We also compare how stop word filtering affects the
performance of the system.
The response time for queries with and without stop words filtering remains relative flat at about 100 ms when the query size is smaller than 800,000 tweets. As
query size increases beyond that, the response time raises rapidly, and the performance gain from filtering stop words becomes more apparent. In all cases, trend
detection performs better with stop words filtering than without.
We are able to run trend detection on up to 12.8 millions tweets in less than 440
ms without the GPU running out of memory when constructing a histogram. The
memory issue is mainly due to the Thrust sorting function which allocates auxiliary
arrays to allow parallel operations, hence requiring additional memory. To alleviate
this limitation, one could explore a streaming algorithm for constructing histograms
or simply use multiple GPU’s.
40
GPU
700
600
Time (ms)
500
400
unfiltered
filtered
300
200
100
0
1
2
4
8
16
32
64
128
Query size (x100,000 tweets)
Figure 5-1: Response time of trend queries for different query sizes.
5.2.2
Histogram Construction
To evaluate the performance gain from a GPU implementation, we study how our
histogram construction fares against a COUNT with GROUP BY SQL query. We load
the same table into the MapD server and a PostgreSQL database. This table contains
64 million rows and five columns: row identifier, time, x-coordinate, y-coordinate and
text. Each row in the text column contains only one word that is sampled from the
first word in actual tweets. To speed up the GROUP BY SQL, we create indexes on
the x, y and time columns on the PostgreSQL table. We also set the shared buffers
to 16 GB, the work mem to 4 GB, and the maintenance work mem to 8 GB.
On both systems, each query groups the rows into 10 x-bins, 10 y-bins, and 2
time-bins. We disable stop word filtering in our system in order to make both systems
comparable. In addition, to study the performance overhead PostgreSQL suffers from
calculating the bin identifiers sequentially, we compare its performance to a similar
table that has precomputed bin identifiers. For the query plans PostgreSQL uses on
these tables, see Appendix B. To omit outliers and cold start times, we perform 10
successive runs per query size and use the median execution time as our result.
As Figure 5-2 shows, both systems show query time raising rapidly as the query
size increases. As expected, in PostgreSQL, the table consistently performs better
41
with precomputed bin identifiers than without. At a sufficiently large query size, our
implementation performs significantly better than a COUNT with GROUP by query
Time (ms)
in PostgreSQL, and its magnitude of speedup increases with query size.
PostgreSQL
100,000
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
Unbinned
Binned
1
2
4
8
16
32
64 128 256 512 1,024
Query size (x100,000 rows)
MapD
300
Time (ms)
250
200
150
GPU
100
50
0
1
2
4
8
16
32
64 128 256 512 1,024
Query size (x100,000 rows)
GPU Speedup
1000x
100x
Unbinned
Binned
10x
1x
0x
1
2
4
8
16
32
64 128 256 512 1,024
Query size (x100,000 rows)
Figure 5-2: Histogram construction time on MapD vs. PostgreSQL.
42
5.2.3
Partial Sorting
To evaluate our implementation of partial sorting of multiple sub-arrays in parallel,
we compare its performance against a sequential version in CPU. Our data is a large
array containing randomly generated 32-bit keys and another array of the same size
containing 32-bit values. We divide the array pairs evenly into many pairs of subarrays, and sort each of these pairs by key to return top K elements. Both the CPU
and GPU versions use the same data set.
Figure 5-3 shows that when the number of sub-arrays is less than 32, the sorting
time of our parallel implementation is relatively flat and performs worse than the CPU
version in most cases. This suggests that in these cases, the function does not have
sufficient data to take advantage of parallelism. In addition, as we increase the value
of parameter K, the performance of our parallel implementation degrades; whereas
the CPU version stays the same. On the other hand, when the number of sub-arrays
is large enough and K is relatively small (< 20), as is often the case with our trend
detection algorithm, the GPU version performs better than its CPU counterpart.
43
1,000 elements per sub-array, GPU
10
GPU, k=16
GPU, k=8
GPU, k=1
CPU, k=16
CPU, k=8
CPU, k=1
Time (ms)
1
0.1
0.01
0
1
2
4
8
16
32
64
128
256
512
Number of sub-arrays
10,000 elements per sub-array, GPU
10
GPU, k=16
GPU, k=8
GPU, k=1
CPU, k=16
CPU, k=8
CPU, k=1
Time (ms)
1
0.1
0.01
1
2
4
8
16
32
64
128
256
512
Number of sub-arrays
100,000 elements per sub-array, GPU
100
GPU, k=16
GPU, k=8
GPU, k=1
CPU, k=16
CPU, k=8
CPU, k=1
Time (ms)
10
1
0.1
1
2
4
8
16
32
64
128
256
512
Number of sub-arrays
Figure 5-3: Performance of partial sorting on GPU vs CPU.
44
Chapter 6
Conclusion
6.1
Contributions
We presented TwitGeo, a system for visualizing trends on Twitter using a fast, GPUbased datastore. On the back end, we introduced a parallel approach to the problem of
detecting trends on Twitter. This approach enables us to identify trending topics from
a large data set on the fly instead of relying on precomputed statistics or predefined
temporal and geospatial indexes. We evaluated how our implementation performs
over different data size and compared its performance against a regular database.
Our parallel implementation of histogram construction resulted in as much as 300×
speedup in comparison to PostgreSQL, and our system detected trending keywords
from 12.8 millions tweets in less than 440 ms.
On the front end, we developed an interface that highlights interesting patterns on
Twitter in both space and time. Our map visualization summarizes what topics are
trending on Twitter over different geographical regions. Our design features elements
that are interactive and encourage users to explore conversations on Twitter.
6.2
Future Work
While our trend detection algorithm shows promising results, more work is needed
before we can deploy this system for real time trend detection on Twitter. Specifically,
45
when handling a large data set that does not fit into GPU memory, the datastore
needs a mechanism to load its data and perform computation in a streaming fashion.
Another area worth pursuing is exploring more sophisticated trend detection models and implementing data parallel variants of them. A simple keyword-based approach has served its purpose of sub-second query latency well, but topic models
such as LDA may offer a more rigorous content analysis on Twitter.
Finally, due to the popularity of smartphones, Twitter users are increasingly sharing photos and videos in their tweets. Adding the functionality of local photo and
video search by users or keywords to our visualization interface may provide another
interesting option to explore events on Twitter.
46
Appendix A
Trend Detection Statistics
In this section, we outline the derivation of the sample means and sample variances
used in Welch’s t-test statistics. For a given word w, let sample Xi denote a set
obtained by testing if each word from tweets in time period i is equal to word w. For
the same time period, let Ni be the total number of words in tweets and Ci be the
number of occurrences of word w.
Xi = {xi,0 , xi,1 , ..., xi,Ni }
X
Ci =
xi,j
j
where
xi,j =


1 if wordj = w

0 if wordj 6= w
It follows that the sample mean X i and sample variance s2i are given by:
Xi =
Ci
Ni
P
− X i )2
N −1
Pi 2
Ni ( j xi,j ) − Ci2
=
Ni (Ni − 1)
s2i =
j (xi,j
47
48
Appendix B
Query Plans
Query plan for the table without precomputed bin identifiers:
EXPLAIN
SELECT SUM(c) FROM
(SELECT binX, binY, binTime, COUNT(*) c FROM
(SELECT
FLOOR( (goog_x - -13888497.96) / (-7450902.94 - -13888497.96) * 10 ) AS binX,
FLOOR( (goog_y - 2817023.96) / (6340356.62 - 2817023.96) * 10 ) AS binY,
FLOOR( CAST((time - 1365998400) AS FLOAT) / (1366084800 - 1365998400) * 2 ) AS binTime
FROM tab4
LIMIT 64000000) tabi
GROUP BY binX, binY, binTime) tabo;
QUERY PLAN
------------------------------------------------------------------------------------Aggregate
->
(cost=4710289.01..4710289.02 rows=1 width=8)
HashAggregate
->
Limit
->
(cost=4566289.01..4630289.01 rows=6400000 width=24)
(cost=0.00..3286289.01 rows=64000000 width=20)
Seq Scan on tab4
(cost=0.00..5352321.40 rows=104235680 width=20)
49
Query plan for the table with precomputed bin identifiers:
EXPLAIN
SELECT SUM(c) FROM
(SELECT binX, binY, binTime, COUNT(*) c FROM
(SELECT * FROM tab5
LIMIT 64000000) tabi
GROUP BY binX, binY, binTime) tabo;
QUERY PLAN
------------------------------------------------------------------------------------Aggregate
->
(cost=2567374.50..2567374.51 rows=1 width=8)
HashAggregate
->
Limit
->
(cost=2423374.50..2487374.50 rows=6400000 width=24)
(cost=0.00..1143374.50 rows=64000000 width=32)
Seq Scan on tab5
(cost=0.00..1862177.60 rows=104234760 width=32)
50
Bibliography
[1] P. Bakkum and K. Skadron. Accelerating sql database operations on a gpu with
cuda. In Proceedings of the 3rd Workshop on General-Purpose Computation on
Graphics Processing Units, pages 94–103. ACM, 2010.
[2] J. Benhardus and J. Kalita. Streaming trend detection in twitter. International
Journal of Web Based Communities, 9(1):122–139, 2013.
[3] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi:
interactive topic-based browsing of social status streams. In Proceedings of the
23nd annual ACM symposium on User interface software and technology, pages
303–312. ACM, 2010.
[4] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD international conference on Management of
data, pages 325–336. ACM, 2006.
[5] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proceedings of the
2004 ACM SIGMOD international conference on Management of data, pages
215–226. ACM, 2004.
[6] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander.
Relational query coprocessing on graphics processors. ACM Transactions on
Database Systems (TODS), 34(4):21, 2009.
[7] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious
parallelism for in-memory column-stores. Proceedings of the VLDB Endowment,
6(9), 2013.
[8] T. Kraft, D. X. Wang, J. Delawder, W. Dou, L. Yu, and W. Ribarsky. Less
after-the-fact: Investigative visual analysis of events from streaming twitter.
[9] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, and R. C.
Miller. Twitinfo: aggregating and visualizing microblogs for event exploration.
In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 227–236. ACM, 2011.
51
[10] M. Mathioudakis and N. Koudas. Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference
on Management of data, pages 1155–1158. ACM, 2010.
[11] T. Mostak. TweetMap. http://worldmap.harvard.edu/tweetmap/, 2013.
[12] I. B. Murphy.
Fast Database Emerges from MIT Class,
GPUs
and
Students
Invention.
http://data-informed.com/
fast-database-emerges-from-mit-class-gpus-and-students-invention/,
2013. [Online; accessed 24-Aug-2013].
[13] NVIDIA.
NVIDIA
Unveils
New
Flagship
GPU
For
Visual
Computing.
http://nvidianews.nvidia.com/Releases/
NVIDIA-Unveils-New-Flagship-GPU-for-Visual-Computing-9e3.aspx,
2013. [Online; accessed 24-Aug-2013].
[14] D. Ramage, S. T. Dumais, and D. J. Liebling. Characterizing microblogs with
topic models. In ICWSM, 2010.
[15] X. Wang, W. Dou, Z. Ma, J. Villalobos, Y. Chen, T. Kraft, and W. Ribarsky. Isi: Scalable architecture for analyzing latent topical-level information from social
media data. In Computer Graphics Forum, volume 31, pages 1275–1284. Wiley
Online Library, 2012.
[16] J. Zhang, S. You, and L. Gruenwald. High-performance online spatial and temporal aggregations on multi-core cpus and many-core gpus. In Proceedings of the
fifteenth international workshop on Data warehousing and OLAP, pages 89–96.
ACM, 2012.
52
Download