Visualizing Trends on Twitter by Cheng Hau Tong B.S., Massachusetts Institute of Technology (2008) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2013 c Massachusetts Institute of Technology 2013. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science August 26, 2013 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samuel Madden Professor Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert R. Meyer Chairman, Masters of Engineering Thesis Committee 2 Visualizing Trends on Twitter by Cheng Hau Tong Submitted to the Department of Electrical Engineering and Computer Science on August 26, 2013, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract With its popularity, Twitter has become an increasingly valuable source of real-time, user-generated information about interesting events in our world. This thesis presents TwitGeo, a system to explore and visualize trending topics on Twitter. It features an interactive map that summarizes trends across different geographical regions. Powered by a novel GPU-based datastore, this system performs ad hoc trend detection without predefined temporal or geospatial indexes, and is capable of discovering trends with arbitrary granularity in both dimensions. An evaluation of the system shows promising results for visualizing trends on Twitter in real time. Thesis Supervisor: Samuel Madden Title: Professor 3 4 Acknowledgments I would like to thank Professor Samuel Madden for his invaluable ideas, support and guidance throughout the course of this research. He has been a great mentor and I am greatly indebted to him. I would also like to acknowledge Todd Mostak, the creator of MapD for the opportunity to work with him on this project. I thank him for his inspiration and showing me the wonder of GPU computing. I also thank Adam Marcus for his incredible help, advice and the brainstorming sessions we had. Finally, I would like to thank my family and friends for their support over the years. 5 6 Contents 1 Introduction 11 2 Background 15 2.1 MapD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 TweetMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Database Operations on GPU . . . . . . . . . . . . . . . . . . 17 2.3.2 Trend Detection on Twitter . . . . . . . . . . . . . . . . . . . 18 3 Trend Detection 3.1 19 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 COUNT with GROUP BY . . . . . . . . . . . . . . . . . . . . 22 3.2 Trend Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Partial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 User Interface 29 4.1 Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Tweets Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Evaluation 5.1 39 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 39 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.1 System Performance . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.2 Histogram Construction . . . . . . . . . . . . . . . . . . . . . 41 5.2.3 Partial Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6 Conclusion 45 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A Trend Detection Statistics 47 B Query Plans 49 8 List of Figures 3-1 Trend detection inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4-1 Visualization interface design. . . . . . . . . . . . . . . . . . . . . . . 30 4-2 Visualization interface screenshot. . . . . . . . . . . . . . . . . . . . . 30 4-3 Map layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4-3 Map layers cont. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4-4 Geo Trends layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4-5 Timeline display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5-1 Response time of trend query. . . . . . . . . . . . . . . . . . . . . . . 41 5-2 Histogram construction time. . . . . . . . . . . . . . . . . . . . . . . 42 5-3 Performance of partial sorting. . . . . . . . . . . . . . . . . . . . . . . 44 9 10 Chapter 1 Introduction Twitter is a social media platform that allows users to share topics and conversations in a short text message limited to 140 characters. Since its launch in 2006, Twitter has become a valuable source of real-time, user generated content about interesting world and local events. In fact, mainstream media today is increasingly using Twitter to source stories for major news events. When certain topics become popular on Twitter, users are often interested in the context about these events. In particular, they want to know where and when these events took place, and how the stories progress over time. Even more so, users are curious about the opinions of the public or some popular figures towards these stories, and how the responses might differ in their local neighborhood versus other locations. However, identifying emerging topics from a large collection of documents and turning this data into meaningful insights is a challenging problem. A prerequisite for detecting events in a temporal stream is to first obtain a representation of its activity over a sequence of evenly spaced time intervals, otherwise known as time series. The procedure of grouping data points from successive time, usually accomplished by a COUNT with GROUP BY query on a database, can be time consuming and does not scale well as the data set grows larger. In order to work around this limitation, some systems resort to precomputing the time series instead of calculating them on the fly. However, this approach is inflexible and imposes additional constraints on analytics that can be derived from the time 11 series. For example, suppose we precompute some time series representing the hourly count of tweets originated from the United States. From this data, we cannot tell what percentage of these tweets comes from the Boston area, or contains the word “Obama”, or is written by user John Doe. Our goal is to visualize trending topics on Twitter, which requires sub-second latency for best user experience. However, this is not the performance most relational systems are engineered to provide, especially over large data sets. In this thesis, we explore a different approach to tackle the problem of trend detection on a large data set. Instead of relying on precomputed statistics, we take advantage of a graphics processing unit (GPU) to speed up data retrieval and computations. In particular, its high memory bandwidth reduces latency waiting for new data and its parallel processing power allows for computations on more data at the same time. As we perform all calculations as needed, this approach also enables us to detect trends on an arbitrary subset of the original data, and not be limited by any predefined parameters. In this thesis, we present TwitGeo, a system for visualizing trends on Twitter. Our contribution is a parallel trend detection algorithm using MapD [12], a fast, GPU-based datastore as the back end. Specifically, we decompose the problem into three phases—grouping, statistics computation and ranking, and present a parallel implementation for each phase. We analyze the performance of our implementation against a traditional database, and discuss cases where one outperforms the other. As an application of our parallel back end, we describe a visualization interface we have built that displays trending keywords on Twitter. We demonstrate how our approach allows for trend detection on tweets with arbitrary temporal and spatial granularity in absence of any indexes on these columns. We begin in Chapter 2 with previous work on database operations on GPU and trend detection on Twitter. In this chapter, we also provide an overview of MapD. In Chapter 3, we describe our parallel implementations of the three phases in trend detection. Chapter 4 presents our visualization interface that allows users to see trend patterns in both space and time. Chapter 5 describes how we evaluate the 12 performance of our system and discusses the results. Chapter 6 concludes this thesis and outlines directions for future work. 13 14 Chapter 2 Background In this chapter, we begin with a brief overview of our novel database back end in Section 2.1 and its existing visualization interface for Twitter in Section 2.2. We discuss related work in database implementations on GPU and trend detection on Twitter in Section 2.3. 2.1 MapD The Massively Parallel Database (MapD) is a novel parallel datastore created by Todd Mostak, a fellow researcher at MIT. Built on top of NVIDIA CUDA platform, the system uses GPU’s massively parallel processing power and high memory bandwidth to speed up common database operations, resulting in as much as a 70× speedup in comparison to CPU-based systems [12]. The architecture of MapD is designed to optimize query performance by using parallelism efficiently. While MapD stores its data on disk in memory-mapped files, it caches as much data as possible in the GPU memory. This eliminates unnecessary data transfer from the system to the GPU because such transfers provide lower bandwidth than memory accesses within the GPU. In most queries, a database performs the same computation on items from the same column, and these operations can be carried out independently. By storing its data in columnar format, MapD maximizes data parallelism and allows the GPU to compute on many items at the same time. In 15 addition, it improves memory performance due to the coalesced pattern of its memory access. MapD has great query performance, often taking just milliseconds to access millions of rows, especially with aggregation queries. This performance makes it a great tool for computing analytics that might otherwise take hours on CPU. To speed up operations on text based columns, it splits text into words and translates them into unique integer identifiers known as word ID’s. The additional benefit of doing so is that this representation has a smaller memory footprint and allows for more data to be stored in the GPU. The system features a SQL interface, which allows other researchers to access the database with little learning curve. To facilitate data migration, MapD supports data import from other databases including MySQL and PostgreSQL. In addition, MapD also provides some functionality specific to geospatial applications, such as heat map and geolocation map rendering. However, as a new database still in development, MapD has its own limitations. The system cannot handle data that is not cached in the GPU memory, and requires explicit instruction from users to load the data. Because of this, its working data size is limited by the amount of memory available on the GPU’s, which makes it less useful for big data applications. Presently, the largest memory available on a single top notch GPU is 12 GB [13]. This thesis extends the functionality of MapD by providing a parallel implementation of grouping in GPU. 2.2 TweetMap Another visualization powered by MapD, TweetMap [11] is the predecessor of our interface. Both interfaces are geospatial centric and aim to highlight spatial patterns observed from tweets. While TweetMap is a great tool to visualize the spatial distribution of tweets, it provides limited context and insight into the content of the tweets. Our interface addresses these needs by showing users the trending keywords 16 over different geographical regions on map. In addition, we introduce new elements to our visualization to be more interactive and engaging, as detailed in Chapter 4. 2.3 2.3.1 Related Work Database Operations on GPU Here we give an overview of previous work on database implementation on GPU. Prior to the availability of General-Purpose Graphics Processing Unit (GPGPU) programming languages, researchers exploited direct access to texture memory on GPU to perform vector computations. Govindaraju et al. presented algorithms for common database operations including predicates, boolean combinations, and aggregations [5], and later on sorting [4], all implemented in OpenGL API. Still at their early stage, some database operations on GPU were already seeing performance gain over CPU implementation. The introduction of NVIDIA CUDA platform in 2007 provides developers a programming interface to realize the massively parallel processing power of GPU for database operations. He et al presented an in-memory relational database, GDB with supports for selection, projection, ordering, joins, and grouping and aggregation [6]. Bakkum and Skadron observed a performance gain upwards of 20× on their GPU implementation of a SQLite database against its CPU counterpart, but omitted support for queries with grouping [1]. The most closely related work is the study by Zhang et al. to aggregate spatial and temporal data from taxi trips in New York City on GPU [16]. The authors presented a parallel grouping algorithm implemented by using sort and reduce by key data parallel primitives from the Thrust library. In order to count taxi pickup records by street and hour, they combined a street identifier vector with a pickup hour vector to form a key vector, sorted the key vector, and counted the occurrence of each unique key in the vector. We take a similar approach in Section 3.1 to construct our histogram for trend detection. 17 While CUDA is becoming stable and mature, recent work in the field is showing increasing interest in device agnostic parallel programming platform. Heimel et al. presented Ocelot, an extension of the open-source columnar database MonetDB [7]. Unlike MapD, Ocelot is implemented in OpenCL and can take advantage of parallelism in both multi-core CPU and GPU. The authors implemented common database operations mostly based on existing work from the area, and also introduced a hash table implementation for their grouping operator. 2.3.2 Trend Detection on Twitter There are numerous tools that help users explore interesting topics on Twitter, but a majority of them do not offer users the flexibility to define the scope of data on the fly. On one hand, some works on prefiltered data set: Eddi [3] uses a user’s feed; TwitInfo [9] requires users to predefine a set of keywords. On the other hand, GTAC [8] detects geo-temporal events from around the world. Our approach is to combine the best of both worlds by allowing the users to be as broad or specific as they wish. This requires a fast trend detection algorithm capable of handling a large collection of unprocessed tweets. While researchers use different definitions of trends on Twitter, we can generalize their models into keyword-based versus topic-based. Benhardus [2] used a keywordbased approach to identify trends from n-grams; whereas Mathioudakis and Koudas [10] defined topics as groups of bursty keywords with strong co-occurrences. Ramage et al. [14], Wang et al. [15], and Lau et al. all used variants of Latent Dirichlet allocation (LDA), a common topic model to discover the underlying topics in tweets. While a topic model might offer better contextual insights, we choose the keywordbased approach for its simplicity and apparent data parallelism. 18 Chapter 3 Trend Detection In this chapter, we present a parallel implementation of trend detection on Twitter. Before diving into the details, we first define the problem. The input given is a list of tweets within a given time range and a given map viewport. Each tweet contains a timestamp, a coordinate and some text. We then divide the viewport evenly into M ×N square grid cells. For each grid cell, return top K words that are most trending among all tweets contained within that cell. We broadly define a word as trending when its mean normalized word frequency is significantly higher in the more recent windows of time compared to earlier windows. At a high level, we can decompose this problem into several pieces. First, for each unique word that occurs in the input tweets, we need to understand its spatial and temporal distribution, that is we need to obtain a 3-D histogram—two dimensions for space and one for time. On a database, this can be accomplished by a SQL query below, where binX, binY and binTime denote identifiers to M bins in x-coordinates, N bins in y-coordinates and L bins in time, respectively. SELECT binX, binY, binTime, word, COUNT(*) FROM tweets GROUP BY binX, binY, binTime, word; Given this histogram, we then compute statistics that will allow us to detect trends observed in the time-series histogram for each word in each space bin. Finally, we 19 need to sort words in each space bin by their previously computed trending statistics. In Section 3.1 - 3.3, we explore how these subproblems can be written to take advantage of a GPU’s massively parallel processing power as well as its high memory bandwidth. 3.1 Histogram Let us further examine the inputs of our trend detection algorithm. Let T denote the number of input tweets, and U be the total number of words in those tweets. The actual inputs are arrays representing fields in tweets, as shown in Figure 3-1. The timestamp, x-coordinate and y-coordinate arrays each has cardinality T. The MapD server represents the text in a tweet as a variable-length array of 32-bit integers known as word identifiers (word ID’s), each of which uniquely identifies a word. Therefore, the server returns text in tweets in the form of a word array of size U containing the word ID’s. To retrieve the word ID’s, we are provided with an offset array of size T which contains indexes into the word array indicating the first word ID of each of the T tweets. timestamp t0 t1 tT −1 x-coordinate x0 x1 xT −1 y-coordinate y0 y1 yT −1 offset o0 o1 o2 word w0 w1 w2 oT −1 w3 w4 w5 wU −1 Figure 3-1: Trend detection inputs consist of five arrays: timestamp, x-coordinate, y-coordinate, offset and word. The array offset contains indexes into word indicating the first word ID of each of the tweets. 20 In the sections that follow, we describe how we preprocess the input arrays and construct a histogram later on. 3.1.1 Preprocessing Before we calculate the histogram, we filter out stop words such as “I”, “am”, “going”, etc, which do not add much context to a sentence and should not be considered as a trending word. This step improves the performance of histogram construction by reducing the data size significantly. We also annotate each word occurrence (i.e. each element in array word ) with its x, y and time-bin identifiers (bin ID’s), based on the number of space and time bins, the bounding box of the viewport, and the time range of the tweets. In general, given a number of bins M and a value range [X0 , X1 ], a bin ID (zero-based) is calculated as binX = b x − X0 × Mc X1 − X0 Algorithm 1 details how we perform these preprocessing steps in parallel. We first load the word ID’s of the stop words to a bitarray, stopWords. We also allocate an array binEntry of size U to store the bin entries. We define a bin entry as a 64-bit integer that describes an occurrence of a word by its x, y and time-bin ID’s, and its word ID. Its value is given by binEntry = binY 56 | binX 48 | wordId 16 | binTime that is 8 bits allocated for y-bin ID, 8 bits for x-bin ID, 32 bits for word ID, and 16 bits for time-bin ID. We will see in Section 3.2 what roles this memory layout plays in the implementation of our trend detection function. Continuing with Algorithm 1, in parallel, each GPU thread preprocesses a single tweet. For each word in the underlying tweet that is not a stop word, the thread writes the value given above to binEntry, or 0 otherwise. After binEntry is populated, we move all non-null bin entries to the head of the 21 Algorithm 1 Creating bin entries and filtering stop words Require: Array binEntry of size U 1: function CreateBinEntries(x, y, time, word, threadId ) 2: calculate binX from x, binY from y, and binTime from t 3: base ← binY 56 | binX 48 | binTime 4: location ← offset[threadId] 5: end ← offset[threadId + 1] 6: while location 6= end do 7: wordId ← word[location] 8: if wordId 6⊂ stopWords then 9: binEntry[location] ← base | wordId 16 10: else 11: binEntry[location] ← 0 12: end if 13: end while 14: end function array, and the rest to the tail by calling a parallel thrust::remove if function in the Thrust library, a parallel algorithms library which resembles the C++ Standard Template Library (STL). This function returns an iterator pointing to the last location in the array where the bin entry is not zero. 3.1.2 COUNT with GROUP BY Now that each occurrence of a word is annotated with its bin ID’s in the form of a bin entry, we can implement a COUNT with GROUP BY function in a few steps outlined in Algorithm 2, which takes much advantage of a rich collection of data parallel primitives provided by the Thrust library. Each occurrence of the same word in the same bin has the same binEntry value. To group these occurrences together, we sort the binEntry array in parallel by calling thrust::sort. We write each unique bin entry and the number of times it occurs into to the arrays uniqueBinEntry and wordCounts, respectively. However, before we can allocate these arrays, we need to find out their size by counting the unique values in binEntry. This can be accomplished by calling thrust::adjacent difference and thrust::count parallel functions. We also allocate an auxiliary array ones of size U and fill it with 1’s. Finally, we count the number of occurrences for each unique bin 22 entry value. We do so by calling thrust::reduce by key, which sums consecutive ones values of the same binEntry value in parallel, and then writes each unique bin entry and its sum of ones to uniqueBinEntry and wordCounts, respectively. Algorithm 2 Counting word occurrence by bins and words Require: Array binEntry 1: function CountByGroups 2: sort binEntry 3: V ← number of unique values in binEntry 4: allocate array ones of size U and fill with 1’s 5: allocate array count of size V 6: wordCounts ← sum of consecutive ones elements of the same binEntry value 7: end function We now have a histogram in the form of a (uniqueBinEntry, wordCounts) array pair. 3.2 Trend Computation In this section, we present how we detect trends via a parallel implementation of Welch’s t-test. Also known as an unequal variance t-test, Welch’s t-test is used to test if two populations have the same mean without assuming that they have the same variance. It defines the statistic t as follows: X1 − X0 t= q 2 s2 s0 + N11 N0 where Ni denotes the sample size, Xi denotes the sample mean and s2i denotes the sample variance of sample i. In our applications, we obtain the two samples by constructing a histogram with two time bins, namely the before, X0 , and the after, X1 . For each space bin, its sample Xi for word ID w denotes a set obtained by testing if each word ID from time bin i is equal to w. Its sample size is the total number of words in time bin i and that space bin; whereas its sample mean is the number of occurrences of word ID w 23 in time bin i normalized by its sample size. See Appendix A for the derivation of the sample means and sample variances. Given sufficiently large sample sizes (≈ 30), the t-value can be used with the standard normal distribution to test the null hypothesis that the samples have the same population means. The larger the t-value, the more evidence suggesting the after ’s normalized word frequency of word w is larger than the before’s, or simply put, word w is more likely to be trending in that space bin. From the histogram, we have the word counts for each bin entry (binX, binY, binTime, wordId ). However, this test also requires the sample sizes, which are the total words counts (over all words) for each bin ID (binX, binY, binTime). We obtain these by summing the individual word counts over the word ID’s for each bin ID, and write the result to array binWordCounts. Given these input arrays, we then implement a parallel version of Welch’s t-test for CUDA (see Algorithm 3). Each GPU thread computes the t-values of multiple bin entries, and writes the results to array t. This algorithm requires repeated, uncoalesced access to the binWordCounts array located on the GPU’s global memory. Our implementation takes advantage of the GPU’s shared memory, which has roughly 100× the bandwidth of its global memory. If the number of (space and time) bins is small enough, we store a local copy of the binWordCounts array in the shared memory to eliminate excessive access to the global memory. As mentioned in section 3.1.1, we concatenate binY, binX, wordId and binTime to form a 64-bit bin entry value. Their byte order in a bin entry ensures that a before’s bin entry, if exists, always locates on the left of its after ’s counterparts (from the space bin, of the same word ID) in the sorted uniqueBinEntry array. In other words, we can find the before’s and after ’s counterparts quickly in a parallel environment. Because there is only one t-value per space bin per word ID, array t may contain interleaving zero’s, which we remove by calling thrust::remove if. This function is stable, so all t-values maintain their original order in the array. 24 Algorithm 3 Welch’s t-test Require: uniqueBinEntry, wordCounts, binWordCounts arrays. 1: function ComputeWelch(threadId ) 2: allocate array w of size of uniqueBinEntry 3: allocate array t of size of uniqueBinEntry and fill with 0’s 4: allocate array sBinWordCounts of size of binWordCounts in shared memory 5: sync threadId < size of binWordCounts 6: sBinWordCounts[threadId] ← binWordCounts[threadId] 7: end sync 8: for all binEntry0, binEntry1 assigned to threadId do 9: retrieve binX, binY, binTime, wordId from binEntry0 10: retrieve binX, binY, binTime, wordId from binEntry1 11: if binTime1 = 1 then 12: if binEntry0 and binEntry1 have matching binX, binY, wordId then 13: wc0 ← wordCounts[binEntry0] 14: else 15: wc0 ← 0 16: end if 17: wc1 ← wordCounts[binEntry1] 18: bwc0 ← 1 + sBinWordCounts[binX, binY,binTime = 0] 19: bwc1 ← 1 + sBinWordCounts[binX, binY,binTime = 1] 20: w[threadId] ← wordId 21: t[threadId] ← welch(wc0, wc1, bwc0, bwc1) 22: end if 23: end for 24: end function 3.3 Partial Sorting In the previous step, we obtained t-values that are ordered by word ID for each space bin. In order to get the top K words that are most likely trending in each space bin, we need to reorder them by their t-values. While Thrust provides an efficient parallel sort function for a large array, what we really need is a partial sort function that sorts hundreds of smaller arrays in parallel and returns the top K elements in each array in decreasing order. The alternative is to partial-sort these arrays on the CPU, but that does not make good use of the GPU’s parallel processing power. In the following section, we describe our parallel partial sort implementation for GPUs. Its inputs are a keys array, a values array, and an offset array of size J, where J is the number of sub-arrays that are concatenated to form keys and values. Offset 25 contains the indexes pointing to the beginning of these sub-arrays. The outputs are arrays sorted keys and sorted values, both of size J × K. Because sub-arrays are independent of each other, our strategy is to launch a grid of thread blocks, each of which in charge of a (keys, values) sub-array pair. Threads within each block sort their assigned sub-array pairs in parallel, so they need to cooperate with each other by sharing data and synchronizing. Here is how the algorithm works. Each block retrieves from its sub-array pair a block size—the number of threads within each thread block—of elements at a time. Then, it partialsorts this block of (keys, values) pairs by their value and stores the top K elements to a temporary array temp1 on the shared memory. The block also maintains another temporary array temp2 to store the top K elements from previous iterations, and it updates temp2 by merging the array with temp1. This process is repeated until all elements in the sub-array have been exhausted. See Algorithm 4. Algorithm 4 Partial-sorting multiple arrays in parallel Require: Arrays keys, values and offsets 1: function PartialSort(blockId, threadId ) 2: allocate arrays temp1 and temp2 of size K and fill with 0’s 3: offset ← offsets[blockId] 4: size ← offsets[blockId + 1] − offset 5: i ← threadId 6: while i < size do 7: temp1 ← BlockPartialSort() 8: merge temp1 to temp2 9: i ← i + blockSize 10: end while 11: write temp2 to sorted keys and sorted values 12: end function Until now, we have not explained how a thread block partial-sorts a block of elements. This algorithm is implemented in two phases, namely a warp-level sort and a K-pass merge. In CUDA architecture, instructions are issued per group of 32 threads referred to as a warp. Therefore, threads within a warp, also known as lanes, are implicitly synchronized. NVIDIA devices with CUDA compute capability 3.0 and above can execute a shuffle instruction SHFL, which allows a thread to read the register 26 of another thread within the same warp. This feature allows threads to exchange data very efficiently at a warp level without the need for any shared memory, which has a latency significantly higher than a register. We use this technique to implement a quick warp-level bitonic sort so that each warp on the block ends up containing a sequence of elements in decreasing order. The results are written to a temporary array tempBlock. We further take advantage of the shuffle instruction in our implementation of a K-pass merge. Unlike the previous phase which uses all threads in a block, the merge procedure only uses the first warp. In this phase, each lane is responsible for the 32-element sorted sequence generated by a particular wrap from the previous phase. Initially, each lane stores the first element in its sequence—which has the greatest value—into its register. In each iteration, we perform a warp-level reduction using SHFL to find the max element among all lanes and save it to the next available slot in temp1. The lane from which this block-level max originates then reloads its register with the next element on its sequence. This process is repeated until we have found the top K elements from the block or there is no data left. 27 28 Chapter 4 User Interface In this chapter, we present a visualization interface that enables users to explore and navigate Twitter events over both dimensions in a simplistic and intuitive fashion. Specifically, this interface aims to provide users with the following: • Quick summary: A quick glance of the interface should inform users what, when and where trending topics are. • Insightful details: Some clues about why a topic is trending. • Analysis: Detailed examination on progression of a trend. • Interactive user experience: An interface that motivates users to explore and discover interesting events. Figure 4-1 outlines the design of our visualization interface, which contains five primary elements. Users may enter keywords and location in the search engine (see Figure 4-1.1) to limit the scope of tweets, for example “Obama” near “Boston”. A timeline chart (Figure 4-1.2) portrays the activity of matching tweets by volume over time; the more tweets during a period, the higher its value on the chart for that period. A tweets display (Figure 4-1.3) exhibits a subset of matching tweets. The centerpiece of the interface is the map (Figure 4-1.5), which shows the locations of all matching tweets, as well as trending words detected in the enclosed map region. Figure 4-2 shows a screenshot of our implementation. 29 Figure 4-1: Visualization interface design. 1) Search engine to limit the scope of tweets by keywords and location. 2) A timeline chart shows how the volume of tweets changes over time. 3) A list showing a subset of matching tweets. 4) Users may change the spatial granularity of the trend detection algorithm to small, medium or large. 5) A map showing locations of matching tweets and trending words on Twitter. Figure 4-2: Visualization interface screenshot. 30 Behind the scenes, the interface maintains a visualization context that describes its current state, including the current viewport on the map, search terms and search time range. User interaction with the interface changes this context, and consequently triggers one or more elements to reload. In the following sections, we detail the implementation of these elements and explain how they help achieve the goals we previously laid out. 4.1 Search Engine The search engine allows users to specify the terms and/or location which they would like to explore. Both parameters are optional, and by default the search engine queries all tweets in the region that is currently displayed on the map. If one or more search terms are specified, the search engine queries only tweets that contain those search terms. Users may specify a location in the form of a human-readable address, which can be a street address, city, state or country. Once users click “Submit” on the search engine, a new viewport is established by translating the search location, if specified, using the Google Geocoding API. In case of an ambiguous address, this API may return multiple results, but only the first result returned will be used as our new viewport. The search engine changes the search terms and current viewport (if changed) in the visualization context, causing the timeline chart, tweets display and map to reload accordingly. 4.2 Map In order to visually highlight the relationships between tweets from different geographic regions, we introduce an interactive map that is dynamically annotated with aggregate data from tweets. We are particularly interested in two pieces of information: where Twitter users tweet from and what topics, if any, are showing unusually high activity across different regions on the map. For those features, we need an open source JavaScript mapping library that is flexible and easy to extend. We choose 31 OpenLayers. The implementation of the map is an overlay of three layers. The base layer is a Google Maps tile layer, which contains 256 × 256-pixel images showing political borders and geography features such as rivers and roads. To show the geographical distribution of tweets, we add a layer of image in which the background is transparent and the foreground consists of colored pixels at locations tweets are sent from, as depicted in Figure 4-3b. We refer to this layer as the Tweet Locations layer, which is implemented as a Web Map Service (WMS) layer rendered on MapD server to take advantage of its parallel processing power. Finally, we add another layer, called the Geo Trends layer, on top to show trending keywords in different regions, as seen in Figure 4-3c. Unlike the previous two layers, this interactive layer is generated on the front end, based on trends data computed on MapD server. The following section describes how we draw this layer in JavaScript and OpenLayers. We divide the viewport into M by N invisible, square grid cells of a default size of 128 × 128 pixels, each of which represents a subregion on the map. Each subregion will end up showing a single “most trending” keyword, if any, so that users can visually associate each subregion with that keyword and get a quick summary of what topics are trending across different subregions. To visually group related regions, neighboring cells sharing the same trending keyword are combined into a single polygon with rectangular edges using Algorithm 5. To highlight the more significant events, keywords that are trending in multiple subregions have their background polygons outlined and colored, as shown in Figure 4-4. Users can zoom or pan the map, causing the base layer to request additional tile images implicitly. The Tweet Locations layer reloads by updating its WMS request parameters to reflect the new viewport, and OpenLayers automatically takes care of the server request and new image placement. To redraw, the Geo Trends layer needs to recalculate its bounding box based on the new viewport and its grid size before it makes a request to the MapD server. Here is some explanation for this requirement. We design the Geo Trends Layer in such a way that for a given zoom level and grid size, the corner of each grid cell is always aligned to a map coordinate that is 32 (a) Google Maps tile layer (b) Tweet Locations layer Figure 4-3: Map is a stack of three layers: (a) Google Maps tile layer, (b) Tweet Locations layer and (c) Geo Trends layer. 33 (c) Geo Trends layer (d) Map Figure 4-3: Map is a stack of three layers. (d) shows an overlay from these three layers. 34 Algorithm 5 Clustering neighboring nodes Require: nodes is an array of grid cells on the map 1: function FormClusters(nodes) 2: clusters ← empty array 3: remaining ← nodes 4: while remaining not empty do 5: node ← remaining[0] 6: cluster ← [node] 7: neighbors ← node.GetNeighbors() 8: while neighbors not empty do 9: neighbor ← neighbors[0] 10: if neighbor in remaining then 11: remaining.remove(neighbor) 12: cluster.add(neighbor) 13: newN eighbors ← neighbor.GetNeighbors() 14: neighbors.add(newN eighbors) 15: end if 16: end while 17: clusters.add(cluster) 18: end while 19: return clusters 20: end function an integer multiple of the grid size. In other words, each unique grid cell on the map (as identified by its coordinate) always maps to the same geographical boundary regardless of the current viewport of the map. What it means visually is when users pan the map, the grid cells and their trending keywords, just like the base layer tile images, follow the movement of the mouse. In order to help users learn more about trending topics on Twitter, each trending keyword is clickable, and as users do so, the trending keyword becomes the new search terms of the visualization context. All visualization elements reload to show only tweets that contain the clicked keyword, and most interestingly, the Geo Trends layer is now showing other trending keywords that are closely related to the clicked keyword. Through a series of mouse clicks, users can quickly develop a context about these topics, and observe patterns in different regions across the map. 35 Snow Montreal Obama Obama Romney Obama Beach Figure 4-4: Geo Trends layer is divided into square grid cells, each of which assigned the most trending keyword based on tweets in that cell. Neighboring cells sharing the same keyword are combined into a single polygon. 4.3 Tweets Display In order to provide details about trending events on the map, we display a list of tweets corresponding to the current visualization context. By using the twitter-text-js library, we identify hashtags, user handles and links embedded in tweets and autolink them to Twitter. This feature allows users to find out more about the tweets’ authors and conversations on Twitter. Also, when users mouse over a tweet in the list, the map pop ups a marker labeled with the author’s username at the location of the tweet. 36 4.4 Timeline The timeline display is a chart that illustrates how trending events progress over time in a clear and meaningful way. The y-axis represents the volume of tweets containing the search terms in the current visualization context; the x-axis represents time. A quick glance of the timeline should give users some idea whether a topic is currently trending upward, trending downward, or simply some random noise. Trends recurring on a daily or weekly basis such as “morning”, “coffee” and “TGIF” also become apparent on the chart. The timeline also serves as a navigation tool by filtering tweets by time. Here is how it works. As users zoom or pan the chart, the x-axis (time) scales or translates, and the chart redraws itself accordingly. The time range of the visualization context is directly tied to the chart. In other words, the tweets display as well as the Tweets Location and Geo Trends layers on the map also reload to correspond to the new time range. We recognize that tweet activity is an absolute measure and does not offer insights into the relative significance and timing of a trend. For example, a topic currently trending at 50,000 tweets/hour does not tell users how the topic ranks against other topics. To address this need, the timeline display allows users to compare the baseline search terms against other terms. Figure 4-5 is based on tweets collected when Twitter is flooded with real-time updates and responses to the Boston Marathon bombing event on April 15, 2013. While both keywords “boston” and “prayforboston” are trending, “boston” is relatively more popular. Also note how the latter slightly lags the previous in becoming a trending event on Twitter. 37 Figure 4-5: Timeline display allows users to compare tweet activity of different events. On April 15, 2013, as two bombs exploded near the Boston Marathon finish line, “boston” and “prayforboston” became trending topics on Twitter. 38 Chapter 5 Evaluation 5.1 Experimental Setup We examine the potential of our implementation for ad hoc trend detection on Twitter based on two criteria: • Query time: How long it takes to identify a list of trending topics on Twitter within a given time range and geographical boundary. For best user experience on the visualization interface, the MapD server should respond to a trend query within minimum amount of time. • Data size: The number of tweets our trend detection implementation can handle. As Twitter users easily generate millions of tweets each day, our system should handle as much data as possible without severely sacrificing its performance. We acquire millions of geo-tagged tweets from the Twitter Streaming API as our data source and load them into the MapD server. To evaluate the overall system performance, we query trending topics based on different number of tweets and measure response time of the JSON requests. We further inspect individual components of our parallel implementation, namely histogram construction and partial sorting. Specifically, we compare their performance over different data sizes against their sequential counterparts. 39 We ran our experiments on a 64-bit Ubuntu Linux 12.04 machine equipped with two 6-Core Intel Xeon Processor E5-2620 2.00 GHz and 64 GB DDR3 1333 MHz system memory. We use CUDA 5.0 as well as Thrust Library 1.5.3 that comes with it. Our GPU device is an NVIDIA GTX 680, which has 4 GB GDDR5 memory with a bandwidth of 192 GB/s, 8 Streaming Multiprocessors with 1536 CUDA cores and CUDA compute capability of 3.0. 5.2 5.2.1 Results System Performance Figure 5-1 shows how the response time of a trend query scales as we increase the query size, defined as the number of tweets the query has to read in order to derive trending keywords. We fixed other query parameters at these values: 10 x-bins, 10 y-bins, and top 10 trends. Our data table contains 12.8 million tweets tweeted from the Continental United States. We also compare how stop word filtering affects the performance of the system. The response time for queries with and without stop words filtering remains relative flat at about 100 ms when the query size is smaller than 800,000 tweets. As query size increases beyond that, the response time raises rapidly, and the performance gain from filtering stop words becomes more apparent. In all cases, trend detection performs better with stop words filtering than without. We are able to run trend detection on up to 12.8 millions tweets in less than 440 ms without the GPU running out of memory when constructing a histogram. The memory issue is mainly due to the Thrust sorting function which allocates auxiliary arrays to allow parallel operations, hence requiring additional memory. To alleviate this limitation, one could explore a streaming algorithm for constructing histograms or simply use multiple GPU’s. 40 GPU 700 600 Time (ms) 500 400 unfiltered filtered 300 200 100 0 1 2 4 8 16 32 64 128 Query size (x100,000 tweets) Figure 5-1: Response time of trend queries for different query sizes. 5.2.2 Histogram Construction To evaluate the performance gain from a GPU implementation, we study how our histogram construction fares against a COUNT with GROUP BY SQL query. We load the same table into the MapD server and a PostgreSQL database. This table contains 64 million rows and five columns: row identifier, time, x-coordinate, y-coordinate and text. Each row in the text column contains only one word that is sampled from the first word in actual tweets. To speed up the GROUP BY SQL, we create indexes on the x, y and time columns on the PostgreSQL table. We also set the shared buffers to 16 GB, the work mem to 4 GB, and the maintenance work mem to 8 GB. On both systems, each query groups the rows into 10 x-bins, 10 y-bins, and 2 time-bins. We disable stop word filtering in our system in order to make both systems comparable. In addition, to study the performance overhead PostgreSQL suffers from calculating the bin identifiers sequentially, we compare its performance to a similar table that has precomputed bin identifiers. For the query plans PostgreSQL uses on these tables, see Appendix B. To omit outliers and cold start times, we perform 10 successive runs per query size and use the median execution time as our result. As Figure 5-2 shows, both systems show query time raising rapidly as the query size increases. As expected, in PostgreSQL, the table consistently performs better 41 with precomputed bin identifiers than without. At a sufficiently large query size, our implementation performs significantly better than a COUNT with GROUP by query Time (ms) in PostgreSQL, and its magnitude of speedup increases with query size. PostgreSQL 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 Unbinned Binned 1 2 4 8 16 32 64 128 256 512 1,024 Query size (x100,000 rows) MapD 300 Time (ms) 250 200 150 GPU 100 50 0 1 2 4 8 16 32 64 128 256 512 1,024 Query size (x100,000 rows) GPU Speedup 1000x 100x Unbinned Binned 10x 1x 0x 1 2 4 8 16 32 64 128 256 512 1,024 Query size (x100,000 rows) Figure 5-2: Histogram construction time on MapD vs. PostgreSQL. 42 5.2.3 Partial Sorting To evaluate our implementation of partial sorting of multiple sub-arrays in parallel, we compare its performance against a sequential version in CPU. Our data is a large array containing randomly generated 32-bit keys and another array of the same size containing 32-bit values. We divide the array pairs evenly into many pairs of subarrays, and sort each of these pairs by key to return top K elements. Both the CPU and GPU versions use the same data set. Figure 5-3 shows that when the number of sub-arrays is less than 32, the sorting time of our parallel implementation is relatively flat and performs worse than the CPU version in most cases. This suggests that in these cases, the function does not have sufficient data to take advantage of parallelism. In addition, as we increase the value of parameter K, the performance of our parallel implementation degrades; whereas the CPU version stays the same. On the other hand, when the number of sub-arrays is large enough and K is relatively small (< 20), as is often the case with our trend detection algorithm, the GPU version performs better than its CPU counterpart. 43 1,000 elements per sub-array, GPU 10 GPU, k=16 GPU, k=8 GPU, k=1 CPU, k=16 CPU, k=8 CPU, k=1 Time (ms) 1 0.1 0.01 0 1 2 4 8 16 32 64 128 256 512 Number of sub-arrays 10,000 elements per sub-array, GPU 10 GPU, k=16 GPU, k=8 GPU, k=1 CPU, k=16 CPU, k=8 CPU, k=1 Time (ms) 1 0.1 0.01 1 2 4 8 16 32 64 128 256 512 Number of sub-arrays 100,000 elements per sub-array, GPU 100 GPU, k=16 GPU, k=8 GPU, k=1 CPU, k=16 CPU, k=8 CPU, k=1 Time (ms) 10 1 0.1 1 2 4 8 16 32 64 128 256 512 Number of sub-arrays Figure 5-3: Performance of partial sorting on GPU vs CPU. 44 Chapter 6 Conclusion 6.1 Contributions We presented TwitGeo, a system for visualizing trends on Twitter using a fast, GPUbased datastore. On the back end, we introduced a parallel approach to the problem of detecting trends on Twitter. This approach enables us to identify trending topics from a large data set on the fly instead of relying on precomputed statistics or predefined temporal and geospatial indexes. We evaluated how our implementation performs over different data size and compared its performance against a regular database. Our parallel implementation of histogram construction resulted in as much as 300× speedup in comparison to PostgreSQL, and our system detected trending keywords from 12.8 millions tweets in less than 440 ms. On the front end, we developed an interface that highlights interesting patterns on Twitter in both space and time. Our map visualization summarizes what topics are trending on Twitter over different geographical regions. Our design features elements that are interactive and encourage users to explore conversations on Twitter. 6.2 Future Work While our trend detection algorithm shows promising results, more work is needed before we can deploy this system for real time trend detection on Twitter. Specifically, 45 when handling a large data set that does not fit into GPU memory, the datastore needs a mechanism to load its data and perform computation in a streaming fashion. Another area worth pursuing is exploring more sophisticated trend detection models and implementing data parallel variants of them. A simple keyword-based approach has served its purpose of sub-second query latency well, but topic models such as LDA may offer a more rigorous content analysis on Twitter. Finally, due to the popularity of smartphones, Twitter users are increasingly sharing photos and videos in their tweets. Adding the functionality of local photo and video search by users or keywords to our visualization interface may provide another interesting option to explore events on Twitter. 46 Appendix A Trend Detection Statistics In this section, we outline the derivation of the sample means and sample variances used in Welch’s t-test statistics. For a given word w, let sample Xi denote a set obtained by testing if each word from tweets in time period i is equal to word w. For the same time period, let Ni be the total number of words in tweets and Ci be the number of occurrences of word w. Xi = {xi,0 , xi,1 , ..., xi,Ni } X Ci = xi,j j where xi,j = 1 if wordj = w 0 if wordj 6= w It follows that the sample mean X i and sample variance s2i are given by: Xi = Ci Ni P − X i )2 N −1 Pi 2 Ni ( j xi,j ) − Ci2 = Ni (Ni − 1) s2i = j (xi,j 47 48 Appendix B Query Plans Query plan for the table without precomputed bin identifiers: EXPLAIN SELECT SUM(c) FROM (SELECT binX, binY, binTime, COUNT(*) c FROM (SELECT FLOOR( (goog_x - -13888497.96) / (-7450902.94 - -13888497.96) * 10 ) AS binX, FLOOR( (goog_y - 2817023.96) / (6340356.62 - 2817023.96) * 10 ) AS binY, FLOOR( CAST((time - 1365998400) AS FLOAT) / (1366084800 - 1365998400) * 2 ) AS binTime FROM tab4 LIMIT 64000000) tabi GROUP BY binX, binY, binTime) tabo; QUERY PLAN ------------------------------------------------------------------------------------Aggregate -> (cost=4710289.01..4710289.02 rows=1 width=8) HashAggregate -> Limit -> (cost=4566289.01..4630289.01 rows=6400000 width=24) (cost=0.00..3286289.01 rows=64000000 width=20) Seq Scan on tab4 (cost=0.00..5352321.40 rows=104235680 width=20) 49 Query plan for the table with precomputed bin identifiers: EXPLAIN SELECT SUM(c) FROM (SELECT binX, binY, binTime, COUNT(*) c FROM (SELECT * FROM tab5 LIMIT 64000000) tabi GROUP BY binX, binY, binTime) tabo; QUERY PLAN ------------------------------------------------------------------------------------Aggregate -> (cost=2567374.50..2567374.51 rows=1 width=8) HashAggregate -> Limit -> (cost=2423374.50..2487374.50 rows=6400000 width=24) (cost=0.00..1143374.50 rows=64000000 width=32) Seq Scan on tab5 (cost=0.00..1862177.60 rows=104234760 width=32) 50 Bibliography [1] P. Bakkum and K. Skadron. Accelerating sql database operations on a gpu with cuda. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 94–103. ACM, 2010. [2] J. Benhardus and J. Kalita. Streaming trend detection in twitter. International Journal of Web Based Communities, 9(1):122–139, 2013. [3] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 303–312. ACM, 2010. [4] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325–336. ACM, 2006. [5] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 215–226. ACM, 2004. [6] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4):21, 2009. [7] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9), 2013. [8] T. Kraft, D. X. Wang, J. Delawder, W. Dou, L. Yu, and W. Ribarsky. Less after-the-fact: Investigative visual analysis of events from streaming twitter. [9] A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, and R. C. Miller. Twitinfo: aggregating and visualizing microblogs for event exploration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 227–236. ACM, 2011. 51 [10] M. Mathioudakis and N. Koudas. Twittermonitor: trend detection over the twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1155–1158. ACM, 2010. [11] T. Mostak. TweetMap. http://worldmap.harvard.edu/tweetmap/, 2013. [12] I. B. Murphy. Fast Database Emerges from MIT Class, GPUs and Students Invention. http://data-informed.com/ fast-database-emerges-from-mit-class-gpus-and-students-invention/, 2013. [Online; accessed 24-Aug-2013]. [13] NVIDIA. NVIDIA Unveils New Flagship GPU For Visual Computing. http://nvidianews.nvidia.com/Releases/ NVIDIA-Unveils-New-Flagship-GPU-for-Visual-Computing-9e3.aspx, 2013. [Online; accessed 24-Aug-2013]. [14] D. Ramage, S. T. Dumais, and D. J. Liebling. Characterizing microblogs with topic models. In ICWSM, 2010. [15] X. Wang, W. Dou, Z. Ma, J. Villalobos, Y. Chen, T. Kraft, and W. Ribarsky. Isi: Scalable architecture for analyzing latent topical-level information from social media data. In Computer Graphics Forum, volume 31, pages 1275–1284. Wiley Online Library, 2012. [16] J. Zhang, S. You, and L. Gruenwald. High-performance online spatial and temporal aggregations on multi-core cpus and many-core gpus. In Proceedings of the fifteenth international workshop on Data warehousing and OLAP, pages 89–96. ACM, 2012. 52