Forecasting Twitter Topic Popularity Using Bass Diffusion Model and Machine Learning AWES by MASSACHUSETTS INSTITUTE Yingzhen Shen JUL 02 2015 Bachelor of Science in Information Science and Technology Tsinghua University, Beijing, China, 2013 LIBRARIES OF rECHNOLOLGY SUBMITTED TO THE DEPARTMENT OF CIVIL AND ENVIRONMENTAL ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN TRANSPORTATION AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY MAY 2015 ( 2015 Yingzhen Shen. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature redacted .... Signature of Author: .............................................. Department of Civil and Environmental Engineering May 18, 2015 Certified by: ..... Signature redacted ....... .......................................... David Simchi-Levi Professor of Civil and Environmental Engineering Thesis Supervisor Accepted by: .......... .................................... . Signature redacted H..i Nepf . . Donald and Martha Harleman Professor of Civil and Environmental Engineering Chair, Graduate Program Committee MITLibraies 77 Massachusetts Avenue Cambridge, MA 02139 http://Iibraries.mit.edu/ask DISCLAIMER NOTICE Due to the condition of the original material, there are unavoidable flaws in this reproduction. We have made every effort possible to provide you with the best copy available. Thank you. The images contained in this document are of the best quality available. Forecasting Twitter Topic Popularity Using Bass Diffusion Model and Machine Learning by Yingzhen Shen Submitted to the Department of Civil and Environmental Engineering in Partial Fulfillment of the Requirements for the Degree of Master of Science in Transportation ABSTRACT Today social network websites like Twitter are important information sources for a company's marketing, logistics and supply chain. Sometimes a topic about a product will "explode" at a "peak day," suddenly being talked about by a large number of users. Predicting the diffusion process of a Twitter topic is meaningful for a company to forecast demand, and plan ahead to dispatch its products. In this study, we collected Twitter data on 220 topics, covering a wide range of fields. And we created 12 features for each topic at each time stage, e.g. number of tweets mentioning this topic per hour, number of followers of users already mentioning this topic, and percentage of root tweets among all tweets. The task in this study is to predict the total mention count within the whole time horizon, 180 days, as early and accurately as possible. To complete this task, we applied two models - fitting the curve denoting topic popularity (mention count curve) by Bass diffusion model; and using machine learning models including K-nearest-neighbor, linear regression, bagged tree, and ensemble to learn the topic popularity as a function of the features we created. The results of this study reveal that the Basic Bass model captures the underlying mechanism of the Twitter topic development process. And we can analogue Twitter topics' adoption to a new product's diffusion. Using only mention count, over the whole time horizon, the Bass model has much better predictive accuracy, compared to machine learning models with extra features. However, even with the best model (the Bass model) and focusing on the subset of topics with better predictability, predictive accuracy is still not good enough before the "explosion day." This is because "explosion" is usually triggered by news outside Twitter, and therefore is hard to predict without information outside Twitter. Thesis Supervisor: David Simchi-Levi Title: Professor of the Department of Civil and Environmental Engineering 3 4 BIOGRAPHICAL NOTES The author graduated in 2009 from Tsinghua University with a Bachelor's degree in Information Science and Technology. She has research experience in the field of transportation choice behavior and the airline industry, and has industrial working experience in the field of financial data mining. She is now a research assistant of Prof. David Simchi-Levi in MIT Civil and Environmental Engineering Department. Her research interests include social network big data, machine learning, and optimization. 5 6 ACKNOWLEDGEMENTS I wish to take this opportunity to appreciate all the help from professors, lab mates, classmates, and friends during my two years in the Department of Civil and Environmental Engineering at MIT. I am so fortunate to study and do research in MIT with a group of excellent and friendly people. Deep appreciation goes to my advisor, Prof. David Simchi-Levi. His understanding, trust, encouragement, and expertise has always motivated me to learn more, and dig deeper in my research. And I believe this experience of working in his group will influence my future positively and profoundly. I would also like to show gratitude to my co-worker, Xiaoyu Zhang, who provided the idea of applying the Bass model, and enormous help during the whole research process. 7 8 CONTENTS ABSTRACT............................................................................................................................ 3 BIOGRAPHICAL NOTES .................................................................................................... 5 ACKNOW LEDGEM ENTS.................................................................................................. 7 CONTENTS........................................................................................................................... 9 List of Tables ..................................................................................................................... 12 List of Figures .................................................................................................................... 13 1. Introduction ........................................................................................................... 15 2. Literature Review .................................................................................................. 17 2.1. Online word of m outh and m arketing ........................................................... 17 2.2. Prediction popularity of individual tw eet ...................................................... 18 2.3. Bass diffusion m odels and extensions ........................................................... 23 2.3.1. Diffusion/Adoption process ..................................................................... 23 2.3.2. Basic Bass Diffusion M odel ..................................................................... 24 2.3.3. Use the Bass m odel to fit or predict ....................................................... 26 2.3.4. Bass curve shape w ith different p and q ............................................... 27 2.3.5. The Generalized Bass m odel................................................................... 28 Integrating social network into econom ic diffusion m odels .......................... 29 2.4. 2.4.1. Tw o types of Tw itter topics w ith different predictability........................ 29 2.4.2. Predict Tw itter hashtag popularity by classification............................... 29 2.4.3. Apply Diffusion Models to Predict the User Amount of Social Network 30 W ebsites.................................................................................................................... 2.4.4. Predict hashtag/keywords popularity when the popularity is defined 32 continuously.......................................................................................................... 3. Data Overview ....................................................................................................... 35 3.1. Definitions ....................................................................................................... 35 3.2. Data source - Topsy API ................................................................................. 35 3.3. Data size - number of topics and time horizon of each topic ........................ 38 3.4. Raw data and its form at................................................................................. 39 9 4. 3.5. Data processing before m odeling ................................................................... 41 3.6. Other data sources providing Twitter data .................................................... 42 M o d e ls ...................................................................................................................... 4.1. Dependent and Independent Variables in Machine Learning Models.......... 43 4.2. Metrics used to evaluate models................................................................... 44 4.3. Machine Learning Regression Models ........................................................... 45 4.3.1. K-nearest-neighbor with feature selection............................................. 45 4.3.2. Linear regression with feature selection ............................................... 46 4.3.3. Bagged regression tree .......................................................................... 46 4.3.4. Ensem ble m odel ....................................................................................... 46 The Basic Bass m odel ...................................................................................... 47 4.4. 4.4.1. M odel form ulation .................................................................................. 4.4.2. Three time aggregation levels - Fitting error and Forecasting error ......... 48 4.4.3. Define the predicted cumulative mention count ................................... 4.5. 47 49 M odeling on a subset of topics ...................................................................... 49 4.5.1. Determine the number of clusters ......................................................... 50 4.5.2. Clustering results ................................................................................... 58 4.5.3. Filtering topics by the total mention count before peak day.................. 59 4 .6. 5. 43 N otatio ns ....................................................................................................... . . 59 Results - Predicting Cumulative Mention Count.................................................. 5.1. M achine learning m odels results ................................................................... 61 61 5.1.1. K-nearest-neighbor with feature selection............................................. 61 5.1.2. Linear regression with feature selection ............................................... 5.1.3. Bagged tree ........................................................................................... 5.1.4. Ensem ble m odel ....................................................................................... 65 The Basic Bass m odel results........................................................................... 67 5.2. 63 . . 64 5.2.1. m or N180 - Which one is better in prediction? ................. 67 5.2.2. Three tim e aggregation levels ................................................................ 69 5.3. Machine learning vs. the Bass model............................................................ 71 5.4. M odeling on a subset of topics ...................................................................... 73 5.4.1. Machine learning on Cluster 1 only ......................................................... 10 74 The Bass model on a subset of topics..................................................... 75 Discussion on the Parameter Estimation Methods of the Bass model ................ 79 5.4.2. 6. 6.1. Review of existing methods to estimate the Bass model parameters .......... 79 6.1.1. error Evaluate estimation methods by fitting error and one-step-ahead predicting 79 6.1.2. Discrete Bass m odel............................................................................... 83 6.2. Parameter estimation procedure in this thesis ............................................. 84 6.3. Evaluate OLS and NLS with Twitter data ....................................................... 84 87 7. Conclusions and Sum m ary .................................................................................... 8. Future W o rk ......................................................................................................... . 89 BIBLIO G RA PHY ............................................................................................................. . 91 Appendix A The List of Topics Used in this Research ............................................. 95 Predicted the Bass model Curves of a Sampled 8 Topics with only Partial Appendix B Data Know n to the M odel............................................................................................... 103 11 List of Tables Table 2.1 Log-likelihood (LL) and Deviance information criterion (DIC) for a 100% observation fraction for the full Retweet Model and a nested Strawman Model - Zaman et a l. (2 0 14 )....................................................................................................................... 22 Table 2.2 Features used in Ma et al. (2013) - 7 content features (Fc) and 11 contextual featu res (Fx )..................................................................................................................... Table 2.3 Five Data Sources for Wang (2011).............................................................. Table 3.1 Topsy A PI ...................................................................................................... 30 32 . 37 Table 3.2 Filter param eters of Topsy API ..................................................................... 38 Table 3.3 Raw Data Part 1 (collected by hour) - Variables for each topic.................... 40 Table 3.4 Raw Data Part 1 -Topic 89 (iWatch) as an Example .................................... 40 Table 3.5 Raw Data Part 2 - Sheet A: Count of influencers mentioning the topic ..... 41 Table 3.6 Raw Data Part 2 - Sheet B: Average follower count of influencers mentioning the ce rtain to p ic ...................................................................................................................... 41 Table 3.7 Raw Data Part 2 - Sheet C: Sum of the influential levels of all influencers....... 41 Table 4.1 Independent variables (features)................................................................. 44 Table 4.2 Variations of KN N ........................................................................................... Table 4.3 Variations of Linear Regression models ....................................................... Table 4.4 Variations of Tree m odels ............................................................................ 45 46 46 Table Table Table Table 51 52 58 58 4.5 4.6 4.7 4.8 Percentage mention count every 5 days (vector being clustered)............... Principal components of mention count vector .......................................... Determ ine num ber of clusters..................................................................... Number of topics in each cluster (k-means).................................................. Table 5.1 The 8 m achine learning m odels ................................................................... 61 Table 5.2 Best m odel at each tim e stage ..................................................................... 65 Table 5.3 Fitting accuracy (relative error with 100% data) with m or N180 ......... Table 5.4 Com pare topics in two clusters..................................................................... 67 74 Table 5.5 Relative error of the Bass model on a subset of topics ................................ 76 12 List of Figures Figure 1.1 Twitter mention count of iWatch ............................................................... 15 Figure 2.1 Spread of One Sample Root Tweet - from Zaman et al. (2014)................... 19 Figure 2.2 Distribution of Reaction Time for 1 Root Tweet - Zaman et al. (2014) ..... 20 Figure 2.3 Prediction of the total number of retweets for two root tweets - Zaman et al. (2 0 14 )................................................................................................................................ 20 Figure 2.4 Median absolute percentage error (MAPE) of Zaman's Retweet Model and 21 Be nch m a rks....................................................................................................................... Figure 2.5 MAPE of Retweet Model vs Benchmarks with no knowledge on the network structure m odel - Zam an et al. (2014) ......................................................................... 22 23 Figure 2.6 Innovators and Im itators .............................................................................. 24 Figure 2.7 Number of Innovators and Imitators over time .......................................... 25 Figure 2.8 The shape of N(t) curve and n(t) curve ...................................................... Figure 2.9 Actual sales and predicted sales for room air - Bass (1969) ....................... Figure Figure Figure Figure 2.10 2.11 2.12 2.13 26 Predict Color TV sales based on the first three points - Bass (1969)......... 27 Comparison of different sigmoid function ................................................. 27 Shape of n(t) curve - q>p on the left, q<p on the right.............................. 27 Shaper of n(t) curve with different values of p (innovation) and q (contagion) ........................................................................................................................................... 28 Figure 2.14 Precision and Recall (source: wikipedia).................................................... 30 Figure 2.15 Twitter Cumulative Adoptions: 5 Years since Introduction -True (left) vs Fitted (right) - W ang (2011).................................................................................................... . 31 Figure 2.16 Predict future Google search frequency for "YouTube"(left) and "Twitter" (right) - Bauckhage et al. (2014) ................................................................................... 31 Figure 2.17 The Linear Influence M odel ....................................................................... 33 Figure 2.18 Influence function across different fields - (a) Politics, (b) Nation, (c) Entertainment, (d) Business, (e) Technology, (f) Sports ............................................... 34 Figure 3.1 Topsy Hom epage .......................................................................................... 36 Figure 3.2 An Example of Using Topsy - Tweets on 'iWatch' per day........................... 36 Figure 3.3 Tweet Count on " iWatch" v.s. "iPhone 65".................................................. 36 Figure 3.4 Time Horizon and Mention Count of Topic 89 - iWatch ............................. 39 Figure 3.5 Keyhole's Results for "iW atch" .................................................................... 42 Figure 4.1 Distribution of the total mention count (220 topics) .................................. 43 Figure 4.2 True vs Fitted cumulative mention count curve........................................... 47 Figure 4.3 Compare fitted curves with aggregation level of 1hr, iday, and 5 days......... 48 Figure 4.4 Non-cumulative mention count of topics with/without fluctuation before peak d ay ..................................................................................................................................... 50 Figure 4.5 Plot the First 2 Principal Components ........................................................ 52 Figure 4.6 Plot the First 3 Principal Components ........................................................ 53 Figure 4.7 Illustration of Elbow m ethod ....................................................................... 53 13 Figure 4.8 Elbow method when the "elbow" cannot be identified ............................... Figure 4.9 Elbow method cost as a function of K, with our Twitter data ..................... 53 54 Figure 4.10 Information criteria with the first 6 principal components of mention count v e cto r ................................................................................................................................ Figure 4.11 Average Silhouette with our Twitter data ................................................. Figure 4.12 An exam ple of Gap criterion ..................................................................... 55 56 57 Figure 4.13 Gap values with our Twitter Data (Input is reduced to 16-dim; Clustered by Gaussian M ixture M odel)............................................................................................. 57 Figure 4.14 Percentage mention count every 5 days of topics in each cluster (K=2) ...... 59 Figure 5.1 Relative error of KNN variations ................................................................... 62 Figure 5.2 Details after peak day - relative error of KNN variations ............................ 62 Figure 5.3 Relative error of linear regression models ................................................. 63 Figure 5.4 Details after peak day - relative error of linear regression models............. 63 Figure 5.5 Relative error of bagged tree models......................................................... 64 Figure 5.6 Details after peak day - relative error of bagged tree models.................... 64 65 Figure 5.7 Relative error of Ensem ble model .............................................................. Figure 5.8 Details after peak day - relative error of ensemble model.......................... 66 Figure 5.9 Compare relative error of four types of models........................................... 66 Figure 5.10 The Bass model predictive accuracy with m or N180................ 68 68 Figure 5.11 Cumulative mention count of Topic 8 - #YOLO .......................................... Figure 5.12 Predictive accuracy with 3 aggregation levels........................................... 69 Figure 5.13 Details - predictive accuracy with 3 aggregation levels............................. 69 Figure 5.14 Fitted curve when the Bass model is fed with data in the first 90, 120, 150, 70 180 days (Topic 89 - iW atch)......................................................................................... Figure 5.15 Fitted curve when the Bass model is fed with data in the first 90, 120, 150, 180 . 71 days (Topic 203 - #ISS) ................................................................................................. Figure 5.16 M achine learning vs. Basic Bass................................................................. 72 Figure 5.17 Details - Machine learning vs. Basic Bass.................................................. 72 Figure 5.18 Predictive accuracy of the Bass model (aggregation level is 1hr) .............. 73 Figure 5.19 Different sigmoid function look similar at early stage............................... 73 Figure 5.20 Percentage mention count every 5 days of topics in each cluster (K=2) ...... 74 Figure 5.21 Predictive accuracy of machine learning ensemble model - All 220 topics vs 75 C luste r 1 o n ly .................................................................................................................... Figure 5.22 Details after the peak day - Predictive accuracy of machine learning ensemble 75 m odel - All 220 topics vs Cluster 1 only ........................................................................ Figure 5.23 How much relative error of the Bass model can be reduced by using a subset 77 of topics (Data are aggregated to 1hr) .......................................................................... Figure 6.1 Predicting error - OLS (dashed line) vs NLS (solid line)............................... 85 14 1. Introduction On social network websites such as Twitter, some topics will "explode" at a "peak day," suddenly being talked about by a large number of users. Some users hear about this topic from TV, newspaper, or new websites; while others learn from their Twitter feeds every day, including tweets posted by their Twitter friends. If the topic is about a product, then these tweets offer insights into the logistics and supply chain of this product. For example, if we can predict whether or when the "explosion" will happen, then the company can plan ahead to increase production. Or if the comments on the topic is negative, such as "Toyota brake," then the prediction of "explosion" will help the company to detect the problem and recall the products earlier. Figure 1.1 shows the number of tweets posted about the topic "iWatch," both cumulative and non-cumulative. And the peak day is on Sep 9, 2014, the day Apple's CEO announced iWatch. 16 16 o14 3 10 Cumulative mention count -- 8 non-cumulative mention X 12 count peak day, 2014/9/9 10 9 E 6 5 8 4 6 ( _ I 3 4 2 Y E 0 0 Figure 1.1 Twitter mention count of iWatch In this study, we use "diffusion" to denote the process that a topic is known by more and more users on Twitter; and "adoption" is the process that users tweet, retweet, or reply a topic. We collected Twitter data on 220 topics, covering a large variety of fields (Appendix A). For each topic, we collect its mention count on Twitter from 120 days before the peak day, to 60 days after the peak day. And we created 12 features for each topic at each time stage, e.g. number of followers of users already mentioning this topic, and percentage of root tweets among all tweets. 15 The task in this study is to predict the total mention count for every topic within the 180 days, as early and accurately as possible. To complete this task, we will apply the Bass diffusion model to fit the mention count curve; and apply machine learning models including K-nearest-neighbor, linear regression, bagged tree, and ensemble to learn the total mention count as a function of the features we created. To our knowledge, this study is the first to apply the Bass model to predict Twitter topic popularity, where a topic is specified by keywords (e.g. "Jeremy Lin") or hashtag (e.g. "#JeremyLin"). The results of this study reveal that the Basic Bass model captures the underlying mechanism of the Twitter topic development process. And we can analogue Twitter topics' adoption to a new product's diffusion. Using only mention count, over the whole time horizon, the Bass model has much better predictive accuracy, compared to machine learning models with extra features (e.g. %root tweet). However, even with the best model, the Basic Bass model, and focusing on the subset of topics with better predictability, predictive accuracy is still not good enough before the peak day (Day 120). This is because peak day is usually triggered by news outside Twitter, and therefore is very hard to predict before the peak day without information outside Twitter. The rest of this thesis is organized as follows. In Section 2, we will review previous work related to the diffusion process and online word of mouth (WoM), especially Twitter. Section 3 will give an overview of the Twitter data we collected and processed. In Section 4, we will describe how to model Twitter data by machine learning models and the Bass model. Model results will be discussed in Section 5. In Section 0, we will discuss the parameter estimation procedures for the Bass model. Finally Section 0 will draw conclusions, and Section 0 will provide insights to future research directions. 16 2. Literature Review In this section, we will review previous research on predicting the impact and popularity of online word of mouth, especially by using economic diffusion models. 2.1. Online word of mouth and marketing Individuals learn by observing the behavior of others and advice from others. This situation is called social learning. For example, people would buy products based on what their friends have bought, and also their friends' suggestions. Due to the wide use of the Internet nowadays, the connection between people not only exists in the real world, but also online. Therefore today, an important marketing channel is through online WoM, e.g. Amazon product reviews, blogs, and social network websites. There exist lots of literature investigating: - How information (text, photo, or video) diffuses and spreads among a population - What kind of information will go viral What role online WoM play in marketing How to modify marketing strategy to utilize the business value of online WoM To deeply understand the mechanism behind the social learning process, elen and Kariv (2004) modeled the information diffusion process as a Bayes-rational sequential decision making process, where each decision maker observes only his/her predecessor's binary action. Later elen and Kariv (2005) and elen et al. (2010) designed experiments to confirm this theory, under perfect information condition when individuals can observe all the decisions that have previously been made, and imperfect information condition. Another type of research on the mechanism of online information diffusion was about videos, predicting whether or not a video will go viral based on the features of the video and the network structure. Wallston (2010) studied the factors that lead viral videos to spread across the Internet. It assessed the relationships that drive viral videos by examining the interplay between audience size, blog discussion, campaign statements, and mainstream media coverage of the video. Guadagno et al. (2013) examined the role of emotional response and video source on the likelihood of spreading an Internet video. Their results indicated that individuals reporting strong affective responses to a video reported greater intent to spread the video; and anger-producing videos were more likely to be forwarded. Archak et al. (2011) connected online WoM to marketing by incorporating Amazon product review text in a consumer choice model by decomposing textual reviews into segments describing different product features. They demonstrated how textual data can be used to learn consumers' relative preferences for different product features, and also how text can be used for predictive modeling of future changes in sales. 17 Netzer et al. (2012) found it possible to monitor market structure through text mining on online user generated content (UGC), therefore converting the UGC to market structures and competitive landscape insights. Another way to get insights from online WoM was to derive brand sentiment from format of mini-blogs posted online (e.g. length of product review), proposed by Schweidel and Moe (2014). The online WoM is so impactful and insightful, that Tirunillai and Tellis (2012) discovered even the stock market performance was significantly related to online UGC. And interestingly, the effect of negative and positive UGC on abnormal stock returns is asymmetric. Whereas negative UGC has a significant negative effect on abnormal returns with a short "'wear-in" and long "wear-out," positive UGC has no significant effect on these metrics. The wide impact and business power of online WoM motivated researchers to examine when and how the seller should adjust its own marketing communication strategy in response to consumer reviews. Chen and Xie (2008) discovered that offering consumer review information too early leads to a lower profit. In addition, the optimal online marketing strategy should depend on product characteristics, the informativeness of the review, the seller's product assortment strategy, the seller's product value for the partially matched consumers, and consumer heterogeneity in product consumption expertise. Starting from the next section, we will focus on text on social network websites, e.g. Twitter or Facebook, and leave videos, photos, and online retailer product reviews aside. 2.2. Prediction popularity of individual tweet In this section, we will focus on a fundamental and theoretical model to predict the spread of an individual tweet, which was proposed by Zaman et al. (2014). When a root tweet, which is an original tweet created and posted by a Twitter user, is posted on Twitter, followers of this root user will see it on their personal Twitter pages, and might retweet it. In this way, one root tweet is spreading in the Twitter network, and might be known to a large number of users in the end. Zaman et al. (2014) predicted the number of retweets when time goes to infinity, which is an evaluation of root tweet popularity. Given a root tweet and the network structure, which is the "follow" relationship among users, Zaman et al. (2014) built a Bayesian model to forecast the total number of retweets that can be several hops from the root tweet. The data they used were retweets from 52 root tweets (26 training and 26 testing), and the data set is available on Zaman's homepage: http://www.zlisto.com/. Figure 2.1 18 shows the data for the root tweet "Cory Booker has never worked a day in his life. Not. #corybookerstories" by root user "pbsgwen." The plot on the upper left shows the number of retweets of the root tweet versus time. The table on the upper right shows several users retweeting this tweet. The images of the retweet graph at different times are shown below, from which we can see the spreading process of this root tweet in the Twitter network. V j 2 3 1 4 0 <7 I f 1 tvcj W~ d' 0 pbsgwerJ 236731 0 2 hni 0 2030 13 0 edenWxIA keithboykin drugmonkpyb o 928 8048 198/ 3 88 160 0 0 1 1 23 194 0 0 1 niveQ 4589 4658 0 20 0 &A. Time [hrs]) curt af 76 222-l 2 . -21 Figure2.1 Spread of One Sample Root Tweet -from Zaman et al. (2014) Zaman made two assumptions for the spreading process of one root tweet: 1) Reaction time, time interval between parent tweet and root tweet, follows log-normal distribution. Distribution parameters are root tweet specific, different across root tweets. Figure 2.2 is a plot of reaction time distribution, based on the data for one root tweet posted by user "KimKardashian." The observation of log-normally distributed reaction times has occurred in other application areas, e.g. the time for people to respond to emails (Stouffer, Malmgren and Amaral, 2006), and call durations in call centers (Brown et al. 2005). And its psychological and fundamental explanation exists in Ulrich and Miller 1993, and Van Breukelen 1995. 19 2) The decision to retweet or not follows a binomial distribution, and the distribution parameters are root-tweet- and user-specific. 1 KimKardashian 768 retweets 08 C S0.6 V" 4 Log-normal distribution 0 0.2 0 0 150 100 50 Reaction Time [minutes] Figure 2.2 Distribution of Reaction Time for 1 Root Tweet - Zaman et al. (2014) Figure 2.3 shows part of the results, comparing the true and predicted value of the total number of retweets for two root tweets. The solid line is the number of observed retweets versus time. The error bars correspond to the 90% credible intervals of the predictive distribution for the total number of retweets based on observations only up to that time point. And the point in the middle of error bars is the posterior median of the predictive distribution for the total number of retweets. TheRock: 1260 retweets KimKardashian: 768 retweets the final number of observed retweets ow 20- 1500 Soo 0 2 40 60 80 100 120 20 40 60 s 100 120 Time [minutes] Time [minutes] Figure 2.3 Prediction of the total number of retweetsfor two root tweets -Zaman et al. (2014) 20 Zaman then compared his Retweet Model to three benchmarks (Figure 2.4). The metric for evaluating models is Median absolute percentage error (MAPE), which is the median of absolute percentage predicting error among 26 testing samples. MAPE drops with higher observation fraction, which makes sense. The three benchmarks are listed below: 1) Linear regression model log(MX) = flo + fl log(foJ) + EX where MX is the final total number of retweets of root tweet x, fox is the follower count of the root user, and E' is a zero mean, normally distributed error term. This model uses no temporal information and only the follower count of the root user, and has an MAPE of 65%. 2) Regression model (Szabo and Huberman, 2010) log(MX) = f3(t) + log(mx(t)) + EX which uses only the current retweet count mx(t) of root tweet x. 3) Dynamic Poisson model with exponentially decaying rate (Agarwal, Chen and Elango, 2009). This model bins time into 5 minute intervals indexed by k. And the number of retweets in the kth bin is a Poisson random variable with rate Adk. Figure 2.4 shows that Zaman's Retweet Model outperforms for any observation fraction. The intuitive reason is that it uses both temporal information and network structure information. 100 INDynamic Poisson Model - 60 Regression Model A <40 Retweet Model 20 0 20 60 40 80 100 Observation Fraction [%] Figure 2.4 Median absolute percentage error (MAPE) of Zaman's Retweet Model and Benchmarks 21 To demonstrate the impacts of the information of network structure, Zaman compared his Retweet Model to two other benchmark models with no knowledge about network structure. One is a naive model always predicting 1.4mx(tx) as the total retweet count. The other benchmark is Strawman Model that ignores f7, bf, and assumes that M comes from a Poisson distribution (not binomial as before since fjX is unknown) with global rate A. On the other hand, Zaman's Retweet Model assumes that each of the fiX followers of user 0 will independently retweet with probability b, and therefore number of one-hop retweets Mf follows binomial distribution Bi(fX, bf). Results comparing the models are shown in Figure 2.5 and Table 2.1. 100 90 80 !4mx(V) . 70 Strawman Model <0 40 - 20 10 Retweet Model 010 o0 go 40 so 70 80 Y0 90 1o Observed Fraction [9% Figure 2.5 MAPE of Retweet Model vs Benchmarks with no knowledge on the network structure model - Zaman et al. (2014) Table 2.1 Log-likelihood(LL) and Deviance information criterion (DIC) for a 100% observationfractionfor the full Retweet Model and a nested Strawman Model - Zaman et al. (2014) LL DIC Retweet model Strawmanimodel -38,860 83,848 -103,907 208,026 Zaman's model works in predicting the popularity of a single tweet, which is a smaller granularity compared to our task for this thesis, predicting the popularity of a topic. Can we generalize Zaman's one tweet level model to a topic level model? One natural idea is that if we already know about the spread of one root tweet, our task can be boiled down to predicting the spread of root tweets on a certain topic. However, the difficulty is that there are large number of root tweets posted by root users, and 22 there's large overlap among followers of these root users. More complicated, when a follower sees a root tweet, he/she can choose to retweet or post a root tweet him/herself. So we cannot simply sum the retweet count for all root tweets talking about a topic, otherwise there will be double counting, or even triple counting. In the next section, we will look at a model useful at higher aggregation level, treating the existing adopters (root tweets in this Twitter case) as a whole, not individually. 2.3. Bass diffusion models and extensions 2.3.1. Diffusion/Adoption process Rogers (1962) developed the first diffusion model, and defined "diffusion of innovation" as: "The process by which an innovation is communicated through certain channels over time among the members of a social system." The adoption process of an innovation is the steps an individual goes through from the time he hears about an innovation until final adoption (the decision to use an innovation regularly). In the Twitter case, the adoption behavior is to tweet, retweet, or mention a topic after seeing it. An innovation could be any "idea, practice, or object that is perceived as new by an individual or other unit of adoption" (Rogers, 2003). Rogers (2003) also provided four key elements of the diffusion process - innovation, the social system which the innovation affects, the communication channels of that social system, and time. Under this definition, the spread of topics on Twitter can be treated as diffusion. Potential Triers Influence of mass-media communicatiou Influence Innovators Imitatous of WOV Triers Figure 2.6 Innovators and Imitators Source: Slides from Bar-Ilan University (https://facultv.biu.ac.il/~fruchta/829/lec/ 6.pd) 23 New adopters 0 C Imitators E z -Innovator Time Figure 2.7 Number of Innovators and Imitators over time Adopters can be classified into innovators and imitators (Figure 2.6 and Figure 2.7). Innovators obtain information from outside the social network, e.g. from mass-media communication. Imitators, unlike innovators, are influenced in the timing of adoption by the decisions of other members of the social system. 2.3.2. Basic Bass Diffusion Model The Basic Bass Diffusion Model, or the Bass model, was first developed by Bass (1969), and is the most widely used diffusion model. It consists of a simple differential equation that describes the process of how new products get adopted in a population. Below is the math formulation of the Bass model: Number of custoimers who will purcha';e theproduct at time I p x Remaining Potential + q x Adopters x Remainng Potential Imitation Innovation Effect Effect where - p is the coefficient of innovation, or the coefficient of external influence, denoting the effect of the mass-media on potential triers (generation of innovators). q is the coefficient of imitation, or the coefficient of internal influence, denoting the effect of the WoM on potential triers (generation of imitators). More exactly, if we use the following notations: - N(t) = the cumulative number of adopters of the product up to time t m = the total number of potential buyers of the new product = N(oo) n(t) = the number of customers who will purchase the product at time t 24 Then we have the most important equation in Bass (1969) n(t) dN(t) = q p[m - N(t)] + -N(t)[n m dt - N(t)] Solving the above differential equation, with boundary condition N(O) = 0, we get the expression of cumulative number of adopters N(t)= m 1 - e-(p+q)t 1+ e-(p+q)t And other variables, including non-cumulative number of adopters n(t), time of peak adoption t*, and number of adopters at the peak time n(t*) can be derived as follows: p(p + q)2e-(p+q)t [p + qe-(p+q)t]2 dN(t) dt t* = - n p+q n(t*) = -(p q + q) 2 4q And the shapes of N(t) curve and n(t) curve are shown below in Figure 2.8. Cumulative Number of Adopters at Time t Introduction Time (1) of product Noncumulative Number of Adopters at Time I Time (t) Figure 2.8 The shape of N(t) curve and n(t) curve 25 Note that the Bass model holds under several assumptions: - Diffusion process is binary (consumer either adopts, or waits to adopt). - Maximum potential number of buyers (m) is a constant. - Eventually, all m will buy the product. - There is no repeat purchase, or replacement purchase. * The impact of the WoM is independent of adoption time. - Innovation is considered independent of substitutes. * The marketing strategies supporting the innovation are not explicitly included. 2.3.3. Use the Bass model to fit or predict Bass (1969) used his model basically for fitting n(t) - N(t) function, and getting the value of m, p, and q. Using the fitted m, p, and q, the n(t) - t curve can be plotted, as shown in Figure 2.9 (n(t) in this case means sales in from t - 1 to t. Details about the parameter estimation process will be discussed in Section 0. Actual I Predicted Year Figure 2.9 Actual sales and predicted sales for room air - Bass (1969) Since the Bass model has three parameters, m, p, and q, only three pairs of (n(t), N(t)) can fit a curve, to predict future sales. Figure 2.10 is the forecasting annual sales curve fitted with the first three points, provided in Bass (1969). However the parameter prediction with the first three points is not robust. Note that N(t) is in a format of sigmoid function, S(t) = 1/(1 + e-). And if a sigmoid function is shifted left or right, or scaled, the left side (where the first few points are located) will not change much. As shown in Figure 2.11, there's not much difference between the three Scurves on the left-hand side. So intuitively, the prediction power of the Bass model is limited, especially when we fit the curve with only points before the jump of S-curve. 26 7 0 a., 5 a. a. E S Wl 1964 1968 1966 1970 YEAR Figure 2.10 Predict Color TV sales based on the first three points - Bass (1969) Sigmoid 1 = 1/(1+exp(-2x+10)) Sigmoid 2 --- o -Sigmoid 3 1.6 = = , -- 2/(1+exp(-2x+10)) 1.5/(1+exp(-2x+14)) 1.2 a- *-'.'A 110- 0.8 0.4 u 2 10 8 6 4 Figure 2.11 Comparison of different sigmoid function 2.3.4. Bass curve shape with different p and q In this section, we will show the shape of n(t) - t curve with different p and q. When q > p, the peak time t* > 0, which means this is a successful product - the influence of WoM (q) is greater than the external influences (p). On the other hand, if q < p, the product is unsuccessful, since the influence of WoM is smaller than the external influences. n(t) n(t) in p -i0) / Sr 111P 10 / / I I KK t 0.2 JC.E O..4 Figure 2.12 Shape of n(t) curve - q>p on the left, q<p on the right ) Source: Slides from Bar-Ilan University (https://facultv.biu.ac.i/~fruchtg/829/ec/6. df 27 t Low innovation (0.004) and lout conragion (0.001) High inovation (0.04), but /ou- contagion (0.001) SamSal 2 34 Low 6 e 9 tO 112 2 !I it 17 !0 1 21 innovation tO.004), 3 2U 24 r25 V7 28 Z 30a s 31N ' a 37 39 32 40 but high contagion (0.004) .... 342 : T High innovation (0.01) k m7<33 and high contagion (0.004) .................. ....... 2' 2543 5aaa2 oaa 4mt '.3. 3'42 33 1 1'61 123 'S 6 11233 13a21 431it1 14212' 3- 2226 2 32332U342 143 The* Time Figure 2.13 Shaper of n(t) curve with different values of p (innovation) and q (contagion) Source: Lecture notesfrom MIT Sloan (http://www.mit.edu/hause r/Hauser%20PDFs/MIT%2OSloanware%20NOES/Note%20on%20Life%20Cycl e%20Diffsion%2OModels.pdf) 2.3.5. The Generalized Bass model The Generalized Bass model makes it possible to fit a more complicated curve than the simple S-curve. Bass et al. (1994) gave an overview of variations of the Basic Bass model. Most generalization can be written in the following format n(t) dN(t) dt q m p[m - N(t)] +-N(t)[n - N(t)]x(t) where x(t) is the current marketing effort. For example, x(t) can be price (Robinson and Lakhani, 1975), or advertising power. Solving the above differential equation with boundary condition N(O) = 0, we have - m(l - e -Xot-X(O)(p+) N(t) =m( 1 + q e-(X(t)-X(O)(p+q)) p where X(t) = f x(u)du. 28 2.4. Integrating social network into economic diffusion models There are some previous studies focusing on integrating social network into economic diffusion models. Although not the same as our work, they still offer insights to our research direction. 2.4.1. Two types of Twitter topics with different predictability Naaman et al. (2011) found by hypothesis testing that there are 2 types of Twitter topics with significantly different features, and therefore different predictability. - Exogenous trends: trends originating from outside of Twitter, e.g., earthquake, sports games Endogenous trends: trends originating from within Twitter, e.g., popular tweets posted by Obama Exogenous trends are hard to predict based only on Twitter data, since these topics are raised purely by media outside of Twitter. On the other hand, endogenous trends are relatively easy to predict based on Twitter data. Later in this thesis, we will show our topic clustering results, which seem to be consistent with Naa man's conclusion. 2.4.2. Predict Twitter hashtag popularity by classification Ma et al. (2013) predicted which Twitter topics will become popular one or two days later. Twitter topics are annotated by hashtags. Ma formulated the problem as a classification task, defined five ranges of peak day count: [0, 4)], [4, 24)], [24), 44)], [44), 8q5], and [84, +00], and referred to these as being "not popular," "marginally popular," "popular," "very popular," and "extremely popular," respectively. Based on the definition of popularity, Ma applied 5 classification models - Naive Bayes, k-nearest neighbors, decision trees, support vector machines, and logistic regression. The metrics that Ma used to evaluate prediction accuracy is the Fl statistic precision * recall precision + recall where precision (also called positive predictive value) is the fraction of retrieved instances that are relevant; and recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Table 2.2 lists the features Ma used to classify, including 7 content features extracted from a hashtag string and the collection of tweets containing the hashtag, and 11 contextual features extracted from the social graph formed by users who have adopted the hashtag. And Ma observed that contextual features are more effective than content features. 29 relevant elements false negatives true negatives 0 * 0 0 true positi ve, w yelect V itm r .lected 0 Precision 0 = Recall - 0 selected elements Figure 2.14 Precision and Recall (source: wikipedia) Table 2.2 Features used in Ma et al. (2013) - 7 content features (Fe) and 11 contextualfeatures (F,) Fa F, F" X4 , F. FW FIv F F, DEscripion Confainingiigits Binary attribute checking whether a hashtag contains digits Seg WordNu 1 Number of segment words from a hashtag Fraction of tweets containing URL in 7 URLFrac SeatimenrVector 3-Dimension vector: ratio of neutral. positive. and negative tweets in 7 20-Dimension topic distribution vector derived from T using topic model TopicVector HashtagClaritv Kullback-Leibler divergence of word distribution between T and tweets collection T Se, WordClarhivy Kullback-Leibler divergence of word distribution between tweets containing any segment word in h and tweet collection T UserCount Number of users Ulb1 Number of tweets I;7 iveetsNamn Fraction of tweets containing mention @ Replyfrac Fraction of tweets containing RT RetweeFrac AveAuthoritV Average authority of users in G TriangleFrac Fraction of users forming triangles in G," GrapluDensit Density of G. CompatentRatio Ratio between number of connected components and number of nodes in G Aw'F.4eStreagh I Average edge weights in GC Border(IserCoant Number of border users 15-Dimension vector of exposure probability P(k) EiVposureVrtor * Feature 2.4.3. Apply Diffusion Models to Predict the User Amount of Social Network Websites The literature discussed in this section applied diffusion models on the whole social network website, instead of a certain topic. Chang (2010) suggested to apply diffusion 30 models in studying Twitter hashtag adoption, and later Wang (2011) completed the research, applying the Bass model to the Twitter diffusion process. Predicted results are shown in Figure 2.15. Cumulative Adopters by Year The Bass Model 4)00000000 'OC1000I 1>o :2006 W -(r Atw~pler3 3 2006 2011 2011 7 4 44 04?11? 1OOMO vsJO ('eft te -O WOng (211 (rgt Month since Dec.2W0) Figure 2.15 Twitter Cumulative Adoptions: 5 Years since Introduction- True (left) vs Fitted (right) - Wang (2011) In addition to Twitter cumulative adoptions, Bauckhage et al. (2014) investigated patterns of adoption of 175 social media services and web businesses, and predicted Google search frequencies for these social media websites. From Google Trends, Bauckhage collected search frequency of queries such as "Twitter," "eBay," "Facebook," or "YouTube" that indicate interest in social media services. And the model Bauckhage applied was several economic diffusion models, including the Gompertz model, the Bass model, and Weibull model. Results showed that the diffusion models are found to provide accurate and statistically significant fits to the data, and the Gompertz model did the best among the three diffusion models. And collective attention to social media grows and subsides in a highly regular manner. Figure 2.16 is the fitted and predicted curves by three diffusion models, for keyword "YouTube" and "Twitter." Solid lines fit the true value predicted by Google Trends (true value), and dotted lines predict the future Google search frequency for "YouTube" and "Twitter." - Google Trends Weibull - Bass shifted Gompertz - Googo Trends Weibull - - Bass shifted Gompertz 100-10 100600 20- 20 Figure 2.16 Predict future Google search frequency for "YouTube"(left) and "Twitter"(right) - Bauckhage et al. (2014) 31 2.44. Predict hashtag/keywords popularity when the popularity is defined continuously Different from the model discussed in Section 2.4.2, when popularity is defined continuously, classification model does not work anymore. Wang (2011) defined the popularity of hashtag as the percentage of Twitter trending topics denoted by hashtags such as "#JeremyLin" (can also be denoted by keywords such as "Jeremy Lin"). Wang then forecasted this percentage by using the Simple Logistic Growth Model y(t) L = + ae-bt where y(t) is the market share (or adoption rate), L refers to the saturation level, and a and b describe the curve. Hashtag popularity can also by defined as the online mention count on a Twitter hashtag or a news keyword, proposed by Yang and Leskovec (2010). They collected Twitter Hashtags, and phrases in blogs and news from five types of data source shown in Table 2.3. Using Linear Influence Model, Yang and Leskovec predicted mention count on a certain hashtag in the next day The Linear Influence Model models the volume of diffusion over time as a sum of influences of nodes that got "infected" beforehand. For each node (u, v, or w in Figure 2.17), an influence function estimated and is used to quantify how many subsequent infections can be attributed to the influence of that node over time. And the mention count in the next day = the sum of all influence function. Influence function could be an exponential function I(l) = cue-u'. Yang and Leskovec also investigated what the influence function looks like across different fields (e.g. politics, entertainment, business, sports), as shown in Figure 2.18, and found that patterns of influence of individual participants differ significantly depending on the type of the node and the topic of the information. Table 2.3 Five Data Sources for Wang (2011) Type Website Newspaper nytimes.com online.wsj.com washingtonpost.com usatoday.com boston.com huffingtonpost.com salon.com Professional blog 32 TV cbs.com abc.com News Agency reuters.com ap.org wikio.com forum.prisonplanet.com blog.taragana.com freerepublic.com gather.com blog.myspace.com leftword.blogdig.net bulletin.aarp.org forums.hannity.com wikio.co.uk instablogs.com Blogs A Volume u V VV Time Figure 1. The Linear Influence Model models the volume of diffusion over time as a sum of influences of nodes that got "infected" beforehand. Figure 2.17 The Linear Influence Model 33 -News -New's (89.6) --- PB ( 19.9) TV (39.9) -0---Agency (151.1) -+-Blog (91.,) 60' -TV (40.7) 40 Agency (108.3) -- 30 Blog (60.9) S40 30 20 20 0 10 . i . . 70 (77.4) --- PB (107.3) 50 1 3 4 5 6 Time (hours) 2 7 0 8 . . 1 2 40 70 -- News (92.1) - PB (75.5) TV (31.5) -Agency (118.9) (58.8) 50 - 8 9 )1 50 40 . 3 4 5 6 7 Time (hours) 8 (c) Entertainment 30 -News (74.0) - -PB (77.7) -TV (50.2) Agency (139.3) -Blog (139.0) 60 2 -News (50.9) - -PB (66.8) TV (31.7) --"Agency (41.2) - Blog (22.4) 25 IL 20 15 - 30----Blog 7 -News (34.4) - -PB (80.8) TV (56. 1) ' Agency (7 1.1) -- Blog (103.1) (b) Nation (a) Politics 'IW. 5 6 34 Time (hours) 4U, 35 30 25 020 S 15 10 5 C 20 20 10 10 0. . U .. I 2 .- 1 2 3 4 5 6 Time (hours) (d) Business 7 8 9 0 4 Time (hours) 7 8 9 -0 2 3 4 5 6 Time (hours) 7 8 9 (fi Sports (e) Technology Average influence functions of five types of websites: Newspapers lNews). Professional Blogs and Personal Blogs (Blogs). The number in brackets denotes the total influence of a media type. 1 (PB). Tekvision (TV). News Agencies (Agencyl, Figure 2.18 Influencefunction across differentfields - (a) Politics, (b) Nation, (c) Entertainment, (d) Business, (e) Technology, (f) Sports 34 3. Data Overview In this section, we will give an overview of the data used in this research, including the data source, data size, raw data format, and data processing which is necessary before modeling. 3.1. Definitions The concepts used in the research are listed below: Topic: specified by a hashtag (e.g. #WorldCup) or keywords (e.g. Jeremy Lin). Tweet: text posted on Twitter by its users. A tweet is either a root tweet, retweet, or reply. - Mention count: number of tweets mentioning the topic, or in other words, including the hashtag or keywords in its text. - Root tweet: original tweet created and posted by the user * Retweet: tweet forwarded, not originally created - Reply: reply under tweets posted by other users - Peak day: the day in which the topic "exploded" on Twitter, with the most daily mention count over the time horizon. - Peak count: mention count on the peak day, for a certain topic. - Influencer (Influential User): Each Twitter user has an influential score calculated by Topsy. It measures the likelihood that, each time the user says something, people will pay attention. Influence for Twitter users is computed using all historical retweets. "Influential" tags appear for the top 0.5% most influential Twitter users. - Adopters (of a certain topic): users who already tweeted about a certain topic. * 3.2. Data source - Topsy API Data in this thesis were collected using Topsy API (Application Programming Interface). Topsy ( http://topsy.com/ ) maintains all tweets since Twitter's inception in 2006. In addition to tweets as text, Topsy transforms the raw tweets and users' information into numbers that are easier to use, such as number of tweets mentioning a certain keyword in a certain period of time, number of influential authors who talk about a given term. Figure 3.1 shows the homepage of Topsy.com, where you can type in the keywords you are interested in, e.g. "iWatch." And then the tweets per day on the keywords will be shown as a graph (Figure 3.2). It is even available to compare tweets on different keywords on the same graph (Figure 3.3). 35 Figure 3.1 Topsy Homepage Tweets per day: iwatch 4 / K' 'K A K' ~ ] Figure 3.2 An Example of Using Topsy - Tweets on 'iWatch' per day Tweets per day: iWatch and iPhone 6S ?K ANAL YrTcC S Figure 3.3 Tweet Count on "iWatch" v.s. "iPhone 6S" 36 Table 3.1 Tops) API API Category Content APIs Metrics APIs Insights APIs Remarks API Name Tweets Top tweets for a set of terms and filters Bulk Tweets Tweets in bulk for a set of terms and filters Streaming Tweets Stream of tweets matching a set of terms and filters Photos Top photos for a set of terms and filters Links Top links for a set of terms and filters Videos Top videos for a set of terms and filters Citations Time-ordered tweets or retweets referencing a tweet, link, photo, or video Conversation Reply thread for a tweet Tweet Look up tweets by tweet ID Validate Check whether a tweet is still valid (has not been deleted by user) by tweet ID Location Set of places that start with a given string, to be used with /metrics/geo Mentions Number of tweet mentions by time slice for any term Citations Number of total citations (tweets, retweets, and replies) for a particular URL Impressions Number of potential impressions by time slice for any term Sentiment Topsy Sentiment Score (0-100) by time slice for any term Geo Distribution Number of mentions by country, state/province, county or city Related Terms Phrases, hashtags, terms, mentioned a given term Influencers influential authors who talk about and amplify a given term Author Info Information about a Twitter account Source: http://api.topsy.com/doc/ 37 or authors co- Data can not only been seen on Topsy's website, but also available for batch "scraping" by Topsy API. Table 3.1 listed the information that can be obtained by Topsy API. In our research, we applied one of the above three APIs - Metrics API - to collect tweet count on a certain topic every hour. On the basis of the APIs, Topsy also provides filter parameters (Table 3.2) which make it possible to collect tweets by time, location, language, or tweet type (root tweet, retweet, or reply). Table 3.2 Filter parameters of Topsy API Name Description mintime Start time for the report in Unix timestamp format. maxtime End time for the report in Unix timestamp format. region Show results for the specified locations only. A valid region integer ID must be used. lationg Show results from tweets that are geotagged with latitude/ longitude coordinates only. allowjlang Show results in the specified language only. Currently supports 'en' (English), 'zh' (Chinese), 'ja' (Japanese), 'ko' (Korean), 'ru' (Russian), 'es' (Spanish), 'fr' (French), 'de' (German), 'pt' (Portuguese), and 'tr' (Turkish). sentiment Show results with the specified sentiment only. Valid values are 'pos' (Positive), 'neu' (Neutral) or 'neg' (Negative). infonly Show results from influential users only. tweet-types Show results only of the specified tweet type. Supported values are: 'tweet', 'reply', 'retweet'. Source: http://opi.topsy.comn/doc/filter-parameters/ 3.3. Data size - number of topics and time horizon of each topic We collected data of 220 topics covering a wide range of fields, from 2009 to 2014. The list of topics is shown in Appendix A, with the peak day and peak count of each topic. Twitter published its trending (popular) topics every year on its websites, e.g. 38 * 2012 Twitter trending topics: https://2012.twitter.com/en/trends.html * - 2013 Twitter trending topics: https://2013.twitter.com/#category-2013 2014 Twitter trending topics: https://2014.twitter.com/moments For each topic, there's a peak day with the most daily mention count. The length of time horizon of each topic is determined to be 180 days, about 6 months, from 120 days before the peak day, to 60 days after the peak day. For example Figure 3.4 shows the time horizon of Topic 89 - iWatch. We first detected the mention count of "iWatch" reached to a peak on Sep 9 th, 2014; then the time horizon for this topic was determined as May 1 3 th to Nov 8 th 2014. 0D o 14 0 12 07 100 10 E 8 _ 0 1 peak dayc 2014/9/9 9 8 7 ~0 0 6 5 E 2 E 6 4 0 non-cumulative mention count x +. 100 Cumulative mention count -- + 016 Y 2 0 0 Figure 3.4 Time Horizon and Mention Count of Topic 89 - iWatch 3.4. Raw data and its format There are two parts of raw data collected by Topsy API. One with a time unit of one hour, including the variables in Table 3.3 for each topic. Table 3.4 shows data in part 1 for topic 89, which contains the 7 variables every hour for this topic. The second part of raw data was collected by every 5 days, and is consisted of three sheets, shown in Table 3.5, Table 3.6 and Table 3.7. In each of the three sheets, each row represents a topic, each column represents a variable in every five days. . There are 180 39 days/5 day = 36 columns in each sheet. For example, in Table 3.5, the number in the 1 st row 1 st column is the count of influencers mentioning topic 1 in day 1~5. Table 3.3 Raw Data Part 1 (collected by hour) - Variables for each topic Variable Name Unix timestamp Mention count Remarks Denoting the start of the hour Mention count of a topic per hour, equals to the sum of root tweet, retweet, and reply Root tweet count Retweet count Reply count Each Twitter user has an influential score calculated by Topsy. It Mention count by influential users only measures the likelihood that, each time the user says something, people will pay attention. Influence for Twitter users is computed using all historical retweets. "Influential" tags appear for the top 0.5% most influential Twitter users. Sentiment score Ranging from 0 to 100, where 0 means negative and 100 means positive. Determined by Topsy API. Impressions Number of users being reached by this topic (these users have Twitter feeds including this topic). Table 3.4 Raw Data Part I - Topic 89 (iWatch) as an Example Unix timestamp 1399917600 1399921200 1399924800 Mention count 38 42 25 Root tweet count 32 36 18 Retweet count 3 5 5 Reply count 3 1 2 Mention count by influential users only 1 1 1 1415462400 31 25 6 0 1415466000 7 6 0 1 40 Sentiment score 60 Impressions 14181 92 76 44378 20994 3 70 22729 0 32 7030 Table 3.5 Raw Data Part 2 - Sheet A: Count of influencers mentioning the topic Topic ID Day 1-5 Day 6~10 ... Day 171-175 Day 176-180 1 1000 1000 997 999 2 0 0 999 999 3 999 1000 ... ... ... 1000 998 219 485 284 ... 332 268 220 909 1000 ... 1000 1000 Table 3.6 Raw Data Part 2 - Sheet B: Average follower count of influencers mentioning the certain topic Topic ID Day 1-5 Day 6~10 1 2 6290.815 15039.53 . 3 219 220 ... Day 171~175 Day 176-180 . ... ... 26423.54 1379.78 6383.804 3913.21 4565.331 2849.977 ... 5527.489 22899.03 21519.15 1413.029 31357.66 17697.21 ... ... 21519.15 1413.029 31357.66 17697.21 Table 3.7 Raw Data Part 2 -Sheet C: Sum of the influential levels of all influencers Topic ID Day 1-5 Day 6-10 ... Day 171~175 Day 176-180 1 7250 7503 ... 7289 7670 2 0 0 ... 1356 1398 3 5438 5213 ... 6405 6313 219 448 424 ... 875 903 220 611 910 ... 792 548 3.5. Data processing before modeling There are missing data problems in the 2 nd part of raw data, shown in the above three tables with a dot ."." For example, in Table 3.6 (sheet B), topic 2 has no information about follower count in the first 10 days. We simply treated these missing values as 0 in our 41 model, which makes sense because in this case, missing data means there's no influencers talking about topic 2 in the first days, and therefore there's 0 follower count. The next step of data processing is to create features (independent variables) and dependent variables for machine learning, which will be introduced in Section 4.1. 3.6. Other data sources providing Twitter data This section will introduce other data sources providing Twitter data that readers might be interested in, although they were not used in this research. Similar to Topsy, Keyhole ( http://keyhole.co/) is another website on which we can search tweets on certain topics, breakdown by location, users' gender, and many other filters. Figure 3.5 is the page returned by Keyhole on the keyword "iWatch." Twitter itself has API (https://dev.twitter.com/overview/documentation ), offering information related to its users, tweets, hashtags and locations. 669 467 1175108 794,410 ' s ."..t. F..K y Figure 3.5 Keyhole's Results for "iWatch" 42 4. Models In this section, we will introduce the two types of models applied to predict Twitter topic popularity respectively - machine learning regression models and the Basic Bass model. 4.1. Dependent and Independent Variables in Machine Learning Models Dependent variable in our models is the total mention count within 180 days, from 120 days before the peak day, until 60 days after the peak day. Consistent to the notation in the Bass model introduction (Section 2.3), let N(180) denote the dependent variable. Figure 4.1 shows the wide distribution of the total mention count. And there are 97 topics having a the total mention count between 10s and 106, which is the mode range. LU TOTAL MENTION COUNT Figure 4.1 Distributionof the total mention count (220 topics) The task of our models is to predict dependent variable, based on independent variables (features) calculated from data within the first x days (0 5 x 180). We have 12 features, and we use fi(x) to denote the ith feature calculated based on data within the first x days. Meaning of each feature is shown in Table 4.1. For example, f 1 (15) is the average mention count per hour within the first 15 days. Note that all features are normalized between [0, 1] before model estimation. In our models, we denote the set of possible values as Firstdays = {15, 30, 45, 60, 75, 90, 100, 110, 120, 130, 140, 150, 165, 180}, and x can take any one value in Firstdays. Intuitively, larger x leads to better prediction because the models knows more with larger x. 43 We have two ways of using the 12 features. Take the 1 l feature, average mention count per hour, as an example. If t = 45 day - Not Merge: We can use the average mention count per hour in the first 45 days (f,(45)). Then there will be 12 features in the model; or Merge: The average mention count per hour in the first 15 days, 30 days, and 45 days (f1 (15),fi(30), and fl(45)). Then there will be 12*3 = 36 features in one model, namely, merging all features before t. Table 4.1 Independent variables (features) f1 (x) Average mention count per hour f 2 (x) Growth rate of mention count: OLS slope of mention -time line f3 (x) % of root tweet = root tweet in first X days / mention count in first X days f4(x) % of retweet fs(x) Average mention count per hour by influential users only f(x) Average impressions per hour f 7 (x) Average sentiment score f8 (x) f 9 (x) Extreme degree of sentiment = avg(I sentiment score(t) - 501) Growth rate of sentiment score: OLS slope of sentiment - time line f,0 (x) Average Influencer count (max limit = 1000) per 5 days f 1 (x) Average follower count per influencer f 1 2 (x) Sum of Influencers' influence level "I.2 MErics used to evaluate nmodels To evaluate the forecasting accuracy of our modes, we have two choices of metrics: 1) Mean relative error = I(Forecast - True)/True I 2) Mean squared error = (Forecast - True)A2 Considering the wide distribution of the total mention count and the huge difference between mention counts of different topics, relative error is preferred, while MSE is too sensitive to topics with large mention count. So we define predicting error as the mean relative error, I (Forecast - True)/True I averaged over all topics. 44 4.3. Machine Learning Regression Models The 220 topics of data are separated into a training set of 184 topics and a testing set of 36 topics. Roughly 1/6 of data are used for testing. In this section, we will introduce the four machine learning models we used to predict the total mention count of a Twitter topic. They are k-nearest-neighbor, linear regression, regression tree, and ensemble. 4.3.1. K-nearest-neighbor with feature selection The first machine learning model is k-nearest-neighbor (KNN). We applied 4 variations of KNN, as shown in Table 4.2, all of which applying the following method: for each x, search for the best combination of K and feature subset. The 4 KNN variations use features in two ways (Section 4.1), and use two model templates to forecast a testing point, given the K nearest neighbor points: - Template 1: KNN - Average of all K neighbors' total mention counts * Template 2: Weighted KNN - Weighted neighbors' mention count by 1/distance Table 4.2 Variations of KNN Machine Learning Model Template The way of using features Model ID 1 2 3 4 KNN Weighted KNN KNN Weighted KNN Not merged Not merged Merged Merged We traverse K from 1 to the Rule of Thumb (= trainingset size ~ 14). For each K, search for the optimal feature subset, and then calculate the corresponding relative error. Finally pick the K and the corresponding feature subset with smallest relative error. Feature subset was implemented by sequential feature selection algorithm (Matlab function "sequentialfs"), in order to avoid overfitting with too many features. Sequential feature selection method has two components: - - An objective function, called the criterion, which the method seeks to minimize over all feasible feature subsets. In our case, the criterion is relative error averaged over topics. For each candidate feature subset, sequential feature selection method performs a 5-fold cross-validation to calculate the criterion. A sequential search algorithm, which adds or removes features from a candidate subset while evaluating the criterion. Since an exhaustive comparison of the criterion value at all 2" subsets of an n-feature data set is typically infeasible, sequential searches move in only one direction, always growing or always shrinking the candidate set. 45 4.3.2. Linear regression with feature selection The 2 nd machine learning model is linear regression with feature selection. Table 4.3 shows the two variations of linear regression models. Table 4.3 Variations of Linear Regression models Machine Learning Model Template Model ID 5 Linear regression 6 Linear regression The way of using features Not merged Merged To avoid overfitting with too many features, feature selection was implemented by stepwise feature selection algorithm, adding a feature when its parameter has a p-value > 0.5, and removing a feature when its parameter has a p_value 0.1. At each step, at most one feature can be added, and at most one feature can be removed. 4.3.3. Bagged regression tree The 3rd machine learning model is bagged regression tree model. We have two variations of tree models, shown in Table 4.4. Bagging is applied to obtain a more robust and stable tree model. Bagging is also called "bootstrap aggregation," and is a type of ensemble learning. Bagging algorithm generates many bootstrap replicas of this dataset and grows decision trees on these replicas. Each bootstrap replica is obtained by randomly selecting N observations out of N with replacement, where N is the dataset size. To find the predicted response of a bagged tree, take an average over predictions from individual trees. In our case, we bagged 10 weak learners (regression trees). Table 4.4 Variations of Tree models Machine Learning Model ID 7 8 Model Template The way of using features Bagged tree Bagged tree Not merged Merged 4.3.4. Ensemble model Different models outperforms at different time stage, which will be shown in Section 5. To utilize advantages of all models, and to mitigate error peaks and fluctuations, the last machine learning model - ensemble model - takes an average of over KNN average, LR average, and Tree average as its forecasted value, where 46 - KNN average = average forecast value over all 4 KNN variations LR average = average forecast value over the 2 LR models Tree average = average forecast value over the 2 tree models 4.4. The Basic Bass model 4.4.1. Model formulation Section 2.3 introduced the idea and format of the Bass model and its extensions. With our Twitter data, we will fit the Basic Bass model, specifically the following equation - N(t) = m 1 1 +1 Ie -(V+qN where - N(t): cumulative number of tweets mention the certain topic up to time t - m = N(oo): total number of tweets when time goes to infinity p: the coefficient of innovation, or the coefficient of external influence, denoting the effect of the mass-media (outside Twitter) on potential adopters. q: the coefficient of imitation, or the coefficient of internal influence, denoting the effect of the existing tweets on the certain topic - Take x = 45 as an example, at the time of day 45, we have a list of (t, N(t)) pairs for t < 45. This list is used to fit the N(t) - t curve to get the estimated m, p, and q. Figure 4.2 takes topic 4 as an example, showing true (solid line) and fitted (dashed line) curve of N(t), the cumulative mention count. Topic 4 - #ChiefKeefMakesMusicFor - 180days - 8000 0 6000 E0 6000 0 E 2000 E. - 0 so 100 day 150 -f te 200 Figure 4.2 True vs Fittedcumulative mention count curve 47 4.4.2. Three time aggregation levels - Fitting error and Forecasting error We tried three time aggregation levels - mention count is aggregated to count per hour, count per day, and count per 5 days. This aggregations level is denoted by intv, and takes three values, 1hr (1/24 day), 1 day, 5 days. Take the aggregation level of 1 day as an example, we feed the model with (N(t), t) pair for t = 1 day, 2 days, ... , 180 days. With all data (x = 180 days) fed to the model, smaller aggregation level means more points on the curve being fed to the model, and therefore leading to better fitting accuracy (Figure 4.3), where fitting error (MSE) is defined as fitting error = 180 1 2' (N(t) -(Nt) 180/intv t=intV When x < 180 days, during the fitting process for optimal m, p, and q, the objective function to minimize is fitting error = 1 x/intv N(t)-IR~t) 2 t=intv Topic 4 - #ChiefKeefMakesMusicFor - 180days 8000 0 6000 E 4000 ....... truel E 2000 - -1hr E~ 1d 5d L 0 20 40 60 100 80 120 140 160 180 day Figure 4.3 Compare fitted curves with aggregation level of Ihr, Iday, and 5 days Note that our task is not fitting, but predicting the total mention count within 180 days based on data within the first x days. Recall that predicting error is defined as the mean relative error, I(Forecast - True)/True I averaged over all topics. And fitting error only makes sense when x = 180 days. When x < 180, say x = 90, the Bass model fits the first half of the N(t) curve. Although smaller granularity leads to smaller fitting error for the first half of curve, when the curve extends to x = 180, the predicting error might be bigger. 48 Smaller granularity leads to violent fluctuation, and when the model tries to fit every points on the fluctuating curve, and extend the curve to day 180, the resulting estimate of the total mention count might be far from the true value. Later in Section Error! Reference source not found., we will compare the predicting error of the three time aggregation levels. 4.4.3. Define the predicted cumulative mention count The true value of our dependent variable is N(180), cumulative mention count within 180 days. 1 - 1+ e + With the fitted parameters m, p, Q, we can predict cumulative mention count. Define the fitted curve R(t) as 1Ae-(i4 p We have two choices, R(180) or M', as the predicted cumulative mention count within 180 days. m (= N(oo)) is the total mention count when time goes to infinity. If we assume the horizon of 180 days covers the whole diffusion process, then m (= N(oo)) can be a predicted value for the total mention count. Later in Section Error! Reference source not found., we will discuss which one of R(180) or ' is a better predicted value. 4.5. Modeling on a subset of topics Besides running our models on all 220 topics, we will also modeling on a subset of topics, since some topics are exogenous trends (Section 2.4.1) originating from outside of Twitter, e.g., earthquake. These topics and are fundamentally unpredictable with only Twitter data. Figure 4.4 plots the non-cumulative mention count every 5 days for topic 26 and topic 89. Peak (day 120) of Topic 26, "pandaAl," was generated by a news about a panda named Ai faking pregnancy. There were no tweets on this topic before the news release, and therefore its total mention count cannot be forecasted without the news information outside Twitter. Topic 89, "iWatch," was announced by Apple's CEO Tim Cook on September 9, 2014, which is consistent to the peak day for this topic on Twitter. But before the peak day, there were already "rumors" about the release of iWatch, explaining the fluctuation on the mention count curve at early stages. Before peak day, mention count of topic 26 seems to be harder to predict, compared to topic 89. That is the reason motivating us to find the subset of topics that are easier to predict. 49 In this section, we will try to detect which topics are exogenous trends, and then in Section 5.4, we will model on endogenous trends only, and see if we can improve the predictive accuracy. Topic 26 - #pandaAl - noncumulative mention count 1200 0 1000- 0 800 E 600- $ 400 E E 200 0 20 0 40 80 60 100 120 140 160 180 day x 1 05 Topic 89 - iwatch ------------------- - C 2 non-cumulative mention count ---- E 0 .-C 1.5 E Peak day: Sep 9, 2014 7, E E 0.5 C 0 - 0 _-4 --- -- 20 ............... ..... ... . w - ---------- 40 60 - C 80 100 day 120 140 160 180 Figure 4.4 Non-cumulative mention count of topics with/without fluctuation before peak day 4.5.1. Determine the number of clusters We will separate the group of exogenous trends and the group of endogenous trends by clustering. Intuitively, exogenous trends have less fluctuation before peak day (topic 26 in Figure 4.4), and endogenous trends have more (topic 89 in Figure 4.4). So the clustering will be applied on the vector of mention count every 5 days, which represents the diffusion pattern of each topic, shown in Table 4.5. Each row represents a topic (220 rows), and each column represents mention count within certain 5 days as a percentage of total mentions in 180 days (180/5 = 36 columns). 50 Table 4.5 Percentage mention count every 5 days (vector being clustered) Time range Day 1-5 Day 6-10 Day 11-15 ... Day 166-170 Day 171-175 Day 176-180 Topic 1 0.534% 0.153% 0.109% ... 1.897% 0.894% 1.199% Topic 2 0.498% 0.299% 0.149% ... 11.211% 9.367% 5.182% Topic 3 0.000% 0.000% 0.000% ... 0.000% 0.000% 0.000% Topic 4 0.000% 0.001% 0.000% ... 0.193% 0.120% 0.249% Topic 220 0.082% 0.097% 0.375% ... 0.084% 0.110% 0.113% To detect the natural underlying number of clusters (number of diffusion patterns), we will apply 5 different methods to determine how many clusters we should have. The 1 st method is applying Principal Component Analysis to reduce the dimension of the percentage mention count vector from 36-dim to 2 or 3-dim, for visualization. In Table 4.6, principal component variance is the eigenvalues of the covariance matrix of the 36dim mention count vector. Dividing the variance by the sum of all 16 variances, we get that * * * * The first The first The first The first The first 2 principal components capture 83.1% of total variance 3: 87.9% 6: 94.6% 10: 97.1% 18: 99.2% Then the first 2 and first three principal components are plotted in Figure 4.5 and Figure 4.6. Visually, there is not much insights on the underlying number of clusters. The 2 nd method is Elbow method, which is a heuristic one. Do K-means for different K. Examine the within-cluster dissimilarity (= average within-cluster point-to-centroid distance) as a function of K. At some value for K the cost drops dramatically, and after that it reaches a plateau (Figure 4.7). This is the optimal K value. Rationale behind Elbow method is: after the optimal K which is consistent to the natural clustering, the new cluster is very near some of the existing. Limitation with Elbow method is that this "elbow" cannot always be unambiguously identified. Sometimes there is no elbow, or several elbows (Figure 4.8). With our Twitter data, we define cost function = average within-cluster point-to-centroid distance. Results are plotted in Figure 4.9. Optimal K is probably 2. 51 Table 4.6 Principal components of mention count vector Principal Component Variance Principal Component Variance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0.09103 0.023717 0.006545 0.004156 0.003485 0.00168 0.001163 0.000984 0.000649 0.000615 0.000524 0.000457 0.000439 0.000406 0.000336 0.000302 0.000276 0.000179 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 0.000159 0.000138 0.000133 0.000115 0.000102 7.55E-05 7.33E-05 6.17E-05 5.23E-05 4.95E-05 4.45E-05 3.57E-05 2.75E-05 2.45E-05 1.95E-05 1.77E-05 1.33E-05 1.11E-33 0.5 CU CL 0.4 -4 0.3 ..4 E 0.2 0 01 - ~6 0 -4 C -0.1 44 44 4444 4444 44444 -0.2 .4 -0.4 4 -0.2 0.2 0 0.4 1st principal component Figure 4.5 Plot the First 2 Principal Components 52 0.6 02 0.4 -0.2 00.22 .- 0 0 0, 00 dC-0.2 20.2 .2nd Principal Component ItPicplCrpnn 1st Prncipal Component Figure 4.6 Plot the First 3 Principal Components C "Elbow" t 3 2 1 4 7 6 S 9 K (no. of clusters) Figure 4.7 Illustration of Elbow method No visible "elbow" . in this plot 0 1 2 3 4 5 6 7 8 K (no. of clusters) Figure 4.8 Elbow method when the "elbow" cannot be identified 53 35 30 S25 0 u 20 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 K (# of clusters) Figure 4.9 Elbow method cost as a functionof K, with our Twitter data The 3 rd method to determine the number of underlying clusters is information Criteria, specifically Akaike's information criterion (AIC) and Bayesian information criterion (BIC). Note that Elbow method will always have smaller cost function for larger K, while Information Criterion overcomes this problem by adding cost penalty for large K. Then the smallest cost value will correspond to the optimal K. The method of Information Criterion follows the three steps: Step 1: Fit Gaussian mixture distribution with K components to data. To avoid illconditioned covariance estimates (columns of input 220*36 matrix might be linearly correlated), we reduced dimension of input vector from 36 to 6 by PCA. Recall that the first 16 principal components capture 94.6% of the total variance, and are enough to represent the original 36-dim vector. Step 2: Calculate likelihood as: Likelihood = Pr(data I Gaussian mixture distribution) Step 3: Calculate information criteria as follows - AIC = -2 BIC = -2 * In(likelihood) + 2 * k * ln(likelihood) + ln(N) * k where N is the # of topics. Figure 4.10 shows the information criterion as function of K, and optimal K = 5. Note that we restricted K to be no larger than 6, because K cannot be too large, or there will be illconditioned covariance estimates. 54 -2 + S-2.5 AIC BIC 0 -3.5 -44 -5 1 3 2 4 5 6 K (# of clusters) Figure 4.10 Information criteria with the first 6 principalcomponents of mention count vector The 4 th method is Silhouette Method. Its basic idea is to compare within-cluster distances with between cluster distances, which is equivalent to adding penalty to large K, as the information criteria. The greater the difference, the better the fit. Silhouette width s(i) for data point i is defined as - b(i) - a(i) max(a(i), b(i)) where - a(i) is the average distance between i and all other points in the same cluster to which i belongs; b(i) is the minimum of the average distances between i and all the points in each other cluster. The silhouette width ranges from -1 to 1. When s(i) ~ 0, the point could be assigned to another cluster as well.; when s(i) ~ -1, the point is misclassified; when s(i) ~ 1, the set of data points is well clustered. A clustering can be characterized by the average silhouette width of individual entities. The largest average silhouette width, over different K, indicates the best number of clusters. Figure 4.11 shows that the optimal K = 2. 55 0.65 0.6 055 0.4 2 4 6 10 8 K (# of clusters) 12 14 16 Figure 4.11 Average Silhouette with our Twitter data The 5th method is Gap criterion, where the gap value is define as Gap value = log(Expected WK) - log(WK) where - Wg is the within-cluster dissimilarity; and - Expected WK is obtained from data uniformly distributed over a rectangle containing the data. Then the optimal K is K* = argmin{KIG(K) ;> G(K + 1) - s 1 } K where sk = SKJ1 /20, and SK is the standard deviation of log(WK). Figure 4.12 is an example of applying Gap criterion to determine the optimal K. Visually, the left chart shows there are 2 clusters underlying. The chart on the right plotted gap values generated with points in the left chart. Error bars denote half width of sk. Then the optimal K = 2 according to Gap criterion, which is consistent to the left chart. Figure 4.13 plotted gap values generated with our Twitter data. According to Gap criterion, the optimal K = 2. 56 U-, 9 'V 0: U) 0 0n 0~ '.K~s. CD U-, + 0 0 0 0 U? 6 4 2 Number of Clusters 2 4 6 Number of Clusters 8 8 Figure 4.12 An example of Gap criterion 2 1.8T 1.6/ 4 1.4 CL 1.2/ 1 0.8 1 2 3 4 5 6 7 8 9 K (# of clusters) Figure 4.13 Gap values with our Twitter Data (Input is reduced to 16-dim; Clustered by Gaussian Mixture Model) Summarizing all the 5 methods (Table 4.7), the underlying number of clusters should range from 2 to 5, most probably 2. On the other hand, we did K-means on our Twitter data, and let K range from 2 to 5. Number of topics in each cluster is shown in Table 4.8. To ensure statistical significance, we need enough topics in each cluster. So the number of clusters should not be too large (> 3), and in the following part of the thesis, we will focus on the results with K = 2. 57 Table 4.7 Determine number of clusters Method Optimal number of Clusters PCA (visually) Not much insights Elbow method 2 Information Criteria (AIC/BIC) 5 Silhouette Method 2 Gap Criterion 2 Table 4.8 Number of topics in each cluster (k-means) K (# of clusters) Cluster 1 Cluster 2 Number of Topics 2 137 3 59 4 85 5 52 83 85 59 28 76 31 43 45 31 Cluster 3 Cluster 4 Cluster 5 66 4.5.2. Clustering results Recall that the 220 topics are clustered by their vectors of percentage mention count every 5 days (36-dim vectors shown in Table 4.5). Let K = 2, and apply K-means clustering on the 220 topics. Then plot the percentage mention count vectors of topics in each of the two clusters in Figure 4.14. Visually, Cluster 1, compared to Cluster 2, has more fluctuation before peak day (day 120). So the total mention count of topics in Cluster 1 might be easier to predict at early stages, based on the information from the fluctuation. Cluster ID of each topic is shown in Appendix A. If we take a deep look at the characteristics of topics in each cluster, we could see that Cluster 1 contains topics relate to quality issue, new product release, periodical topics, such as "Toyota recall," "GM switch recall," "False beef," "iWatch," "Australian Open." For such topics, there should be signals many days before the peak day. Cluster 2, on the contrary, contains topics that are hard to predict, e.g. "Hurricane Earl," "Thatcher death," "MH370." However, these characteristics are not strict or rigorous, for example, Cluster 1 also contains 'Government shutdown', and 'Typhoon in Philippine', which seem to be hard to predict; and Cluster 2 contains 'NBA Finals', which seems to be easy to predict. 58 2clusters - #2 # 2etlutes - 0.81 0.8 0.4. 0.4 0 0 0 50 100 day 0 150 50 100 day 150 Figure 4.14 Percentage mention count every 5 days of topics in each cluster (K=2) 4.5.3. Filtering topics by the total mention count before peak day The objective of clustering and modelling on each cluster is to find out if there are a subset of topics with better predictability. Other than clustering, we can also fit the Bass model on those topics with more mention count before peak day, and therefore might lead to better prediction accuracy. In Section 5.4, we will evaluate the Bass model predictive accuracy on " - All 220 topics Topics in Cluster 1 (137 topics) Topics with cumulative mention count Topics with cumulative mention count 10 before peak day (196 topics) 1000 before peak day (158 topics) 4.6. Notations Here is a summary of notations in our models. - - x: number of days fed to the model and used for predicting the total mention count, o < x !; 180. intv: we tried three time aggregation levels - mention count is aggregated to count per hour, count per day, and count per 5 days. This aggregations level is denoted by intv, and takes three values - 1hr (1/24 day), 1 day, 5 days. Fitting error: mean squared error between true and fitted total mention count with 100% data (x = 180 days) 180 (N(t) - R(t) fitting error= 180/intt 59 t=intv - Predicting error: defined as the mean relative error, * averaged over all 220 topics. N(t), N(t): true value and estimated value of the total mention count on a certain I(Forecast - True)/Truel topic by time t. - M, p, q: true values of the Bass model parameters r', p, q: estimated m, p, and q based on data within the first x days. 60 5. Results - Predicting Cumulative Mention Count In this section, we will show predictive accuracy with machine learning models and Bass diffusion model. Recall that the task of our models is to predict the total mention count on a topic within 180 days. And ideally and hopefully the predicting error will drop to near zero before the peak day (day 120). Also recall that the predicting error is defined as the relative error between true and forecasted total mention count. 5.1. Machine learning models results We have four types of models: * * * K-nearest-neighbor (KNN) with feature selection Linear regression with stepwise feature selection Bagged regression tree Ensemble model With variations, there are 8 machine learning models in all (Table 5.1). Details about the machine learning models can be found in Section 4.3. We feed the models with features based on the first x days, and learn from training set about the total mention count as a function of the features. Table 5.1 The 8 machine learning models Machine Learning Model Template The way of using features KNN Weighted KNN KNN Weighted KNN Linear regression Linear regression Bagged tree Bagged tree Not merged Not merged Merged Merged Not merged Merged Not merged Merged Model ID 1 2 3 4 5 6 7 8 5.1.1. K-nearest-neighbor with feature selection Figure 5.1 shows the performance of KNN models, the trend of relative error as a function of x. And Figure 5.2 shows details after the peak day (Day 120). Optimal K with different x ranges from 1-6, mostly falls in [1,31. At the peak day, relative error increases greatly, which does not make sense. Maybe the forecasting accuracy is very bad for topics with small mention count, resulting to a large percentage error. After the peak day, relative error drops greatly to 10% ~ 20%, which makes sense because typically non-cumulative mention count drops greatly after peak day and cumulative mention count quickly 61 converges to the total mention count. So when x > 120, the model can learn the total mention count from merely the first feature f 1 (x), average mention count per interval. Comparing the 4 KNN variations, different variations outperform at different time stages, or in other words, none of them dominates others over the whole time horizon. Using merged features, or weighting neighbors by distances is not able to improve predictive accuracy effectively. - - KNN % wtKNN_% ---- KNNmerge .-. a... wtKNN-merge 150 130 C a:: LU LU -J LU 110 90 70 Peak day -1 50 30 10 -10 0 20 40 60 80 100 120 140 160 180 DAYS (X) Figure 5.1 Relative error of KNN variations - - KNN_% wtKNN_% ---- KNNmerge --N... wtKNN merge 0.5 - 0.4 -a 0 - - U0.3 LU 0.2 LU r_0.1 0 125 130 135 140 145 150 155 160 DAYS (X) Figure 5.2 Details after peak day - relative error of KNN variations 62 165 5.1.2. Linear regression with feature selection Figure 5.3 plots relative error of linear regression models, and Figure 5.4 shows details after the peak day. Compared to KNN (Relative error mostly between 0~100 over the whole time horizon), linear regression models have smaller relative error, and the relative error curve is more stable without any big fluctuation. - - LR -- LR-mer 60 50 0 LU LU U 40 Peak day 30 20 Cr 10 0 0 15 60 45 30 90 75 105 120 135 150 165 DAYS Figure 5.3 Relative error of linear regression models -- -LR -- LR-mer 5 cc 4 0 LU U at1 ~w. 0 120 125 130 135 140 145 150 155 160 165 DAYS Figure 5.4 Details after peak day - relative error of linear regression models 63 5.1.3. Bagged tree Figure 5.5 plots relative error of bagged tree models, and Figure 5.6 and Figure 5.4 shows details after the peak day. Compared to KNN and LR models, Tree models have the smallest relative error over most of the time horizon, and smallest fluctuation, probably because of bagging. -tree -tree merge 70 60 ~50I 5 Peak day 40 30 I R20 10 0 0 90 60 30 120 150 180 DAYS Figure 5.5 Relative error of bagged tree models --- tree -treemerge 0.8 0.75 0 0.7 a 0.65 0.6 0.55 L 0.5 0.45 0.4 120 150 135 DAYS Figure 5.6 Details after peak day - relative error of bagged tree models 64 165 5.1.4. Ensemble model Different models outperforms at different time stages (Table 5.2), to utilize advantages of all models, and to mitigate error peaks and fluctuations, we then apply an ensemble model taking an average over KNN average, LR average, and Tree average as its predicted total mention count, where - KNN average forecast value = average forecast value over all 4 KNN variations LR average forecast value = average forecast value over the 2 LR models Tree average forecast value = average forecast value over the 2 tree models Figure 5.9 compared the four types of models at the same scale, and shows that that ensemble model has smaller relative error than KNN, LR, or tree models, over almost the whole time horizon. However, results of ensemble model are still not good enough, especially before the peak day (Day 120). At the peak day, relative error shows that the forecast value is still 30 times as much as the true value. This is because peak day is usually triggered by news outside Twitter, therefore it is very hard to predict before the peak day without information outside Twitter. Table 5.2 Best model at each time stage Days 0~60 60~90 90~120 120-150 >150 Best model LR KNN LR & Tree KNN LR Peak day / 50 45 40 ~' N u25 20 u15 10 5 0 0 30 60 90 DAYS 120 Figure 5.7 Relative error of Ensemble model 65 150 180 5 cc 0 4 %k.- cc- 3 LU 2 LU c. 0 135 120 165 150 DAYS Figure 5.8 Details after peak day - relative error of ensemble model LINEAR REGRESSION KNN VARIATIONS cc 0 LU 150 kKINN 130 wtKNN% 110 KNNmerge 90 - % 150 1 wtKNN merge 0 LU 90 70 50 50 30 30 10 10 30 60 90 DAYS 120 150 -10 0 180 150 60 90 DAYS 120 150 180 150 180 130 tree-merge w: 110 0 90 110 90 LU LU 70 LU 30 150 tree 130 of 'I ENSEMBLE MODEL TREE MODELS U: 0 LRmer 110 70 40 0 LR 130 70 U 50 50 30 30 10 10 -10 0 30 60 90 DAYS 120 150 -10 0 180 30 Figure 5.9 Compare relative error of four types of models 66 60 90 DAYS 120 5.2. The Basic Bass model results In this section, we focus on predictive accuracy of the Bass model, with two candidates of predicted total mention count, and three time aggregation levels. , 5.2.1. in or NV(180) - Which one is better in prediction? In the Bass model, we fitted model parameters m, p, and q with N(t) - t pairs for t !; x days. The true value of our dependent variable is N(180), cumulative mention count within 180 days. In Section 4.4.3 we've mentioned that with the fitted parameters ii, q, we can predict cumulative mention count in two ways, R(180) or M' (=R(oo)). Table 5.3 shows that the fitting accuracy is smaller if we use R(180) as the predicted total mention count. For example, when data are at hourly granularity, and use 100% data to fit the Bass model, relative error between M' and the real total mention N(180 days) = 8%, and relative error between R(180 days) and N(180 days) = 2%. And the 2% fitting accuracy in Table 5.3 with aggregation level of 1hr is relatively small, showing the fundamental feasibility of applying the Bass model to describe Twitter topic diffusion process. Figure 5.10 shows that no matter what time aggregation level is, and how many days of information is used to predict, R(180) is always a better prediction than M. One reason to explain the better performance of N(180) might be that for some topics, the cumulative mention count keeps growing and does not converge to a plateau within the 180 days horizon. Figure 5.11 shows such a topic - #YOLO (you only live once). M' is the total mention when time goes to infinity, and nearly no one mentions this topic any more, which does not hold for topic like #YOLO. So in the following part, we will use R(180) as the predicted total mention count within 180 days. Table 5.3 Fitting accuracy (relative error with 100% data) with iit or A(180) (1 N(180) M 1hr 2.0% 8.0% id 5.4% 1922.9% 5d 12.6% 8410.1% ) Time aggregation level 67 Aggregated to 1 day Aggregated to 1hr 50 5 N(180) N(180) --4 40 3 30 2 20 -e -m 0 a) a) 10 0 0 0 30 60 90 120 150 180 30 0 Days known to the model 60 90 120 150 Days known to the model Aggregated to 5 days 50 N(180) 40 2 30 20 10 0 0 30 60 90 120 150 180 Days known to the model Figure 5.10 The Bass model predictiveaccuracy with M' or N(18 0) 0 0 S0 6 0 2 Sx E 3 0 2 1 ada 0 30 60 90 day 120 150 180 Figure 5.11 Cumulative mention count of Topic 8 - #YOLO 68 180 5.2.2. Three time aggregation levels We aggregated the cumulative mention count into 3 aggregation levels - every 1 hr, every 1 day, and every 5 days. Figure 5.12 and Figure 5.13 plot predictive accuracy as a function of x. We can see that small aggregation level (1hr, solid line) has smaller relative error in prediction, over the whole time horizon. o 800% 1hr 700% Iday 600% 5day 500% 400% 300% 100% -. . 200% ft 0% -100% 0 15 30 45 60 75 90 105 120 135 150 165 180 DAY (FIRST X DAYS) Figure 5.12 Predictive accuracy with 3 aggregation levels 0 100% 1hr 80% 0 1day 0-5day 60% LU V 40% LU 20% 0% 15 30 45 60 75 90 105 120 135 150 165 180 DAY (FIRST X DAYS) Figure 5.13 Details - predictive accuracy with 3 aggregation levels 69 If we take a deeper look at the fitted curves with different aggregation levels, at different time state, for certain topics (Figure 5.14 and Figure 5.15), we will find that larger aggregation level in fact fits the cumulative mention count curve better. However, we only care about cumulative mention count at Day 180, and smaller aggregation level fits this point better and therefore leads to smaller predicting error. So in the following part, we will focus on the aggregation level of 1hr for the Basic Bass model. Event 89 - 105 10 iwatch-90days 0 105 true C 1hr id 5d 0 4 8 0 ~ 6 Event 89 - iwatch-120days E E E - ------ 2 6 4 truel Sihr - 1 5d 2 50 0 x 100 day 200 150 0 105 Event 89 - Iwatch-150days 10, 8 200 150 1 Event 89 - Iwatch - 180days 8 / 0 100 day 50 0 6 E 6 >0 4 E 4 a true -- 1hrI id 5d~ 2 50 100 day 150 Ihr 21::d 0L 0 ---....... true --- 0' 0 200 5d .......... 50 100 day 150 200 Figure 5.14 Fitted curve when the Bass model is fed with data in the first 90, 120, 150,180 days (Topic 89 - iWatch) 70 5 5 Event 203 - #ISS-90days 5 x105 Event 203 -#ISS-120days -true o4 Fd Cr E 3 24 d 1 2 El 75 ~ - --- - - --~----- -- E1 50 0 10 S 100 day 100 day 50 0 200 150 1hr 1d 5d 200 150 105Event 203 - #ISS - 180days Event 203 - #ISS-1 50day s o4 6 E 3 3 E 2 E- - -1hr 1d 5d 6d 0 0 50 100 day 150 - -1hr 00 200 - E E 1 50 100 day 150 200 Figure 5.15 Fitted curve when the Bass model is fed with data in the first90, 120, 150, 180 days (Topic 203 - #ISS) 5.3. Machine learning vs. the Bass model Figure 5.16 and Figure 5.17 shows the predictive accuracy of machine learning ensemble model and the Basic Bass model with aggregation level of 1 hr. The Bass model has the smaller relative error over almost the whole time horizon. Using only mention count, the Bass model is no worse than machine learning models with extra features (e.g. % root tweet) over the whole time horizon. It means the Basic Bass model partially captures the underlying mechanism of Twitter topic development process. Twitter topic development process is fundamentally similar to new product diffusion, but with very large p (impact from out-of-network), and is therefore hard to predict before peak day (the 1 2 0 th day). Note that although the Bass model is much better in prediction compared to machine learning models, it still cannot predict well before the peak day, without information outside Twitter, e.g. news triggering the Twitter topic explosion. 71 In addition, recall that in Section 2.3.3, we talked about the intuition that the prediction power of the Bass model is limited, especially when we fit the curve with only points before the jump (day 120) of S-curve, because different sigmoid function look similar at early stage (Figure 5.19). Out results on Twitter data confirmed this intuition. Figure 5.18 shows that even with the best aggregation level (1hr), the predicting error is still large before peak day, and especially before Day 30. 0 50 -- *--cluster 1 only 45 ,--all 220 topics / 40 35 30 25 20 15 10 5 0 0 90 60 30 12 0 150 day Figure 5.16 Machine learning vs. Basic Bass 1.4 0 1.2 C 1 C 0.8 4-J 0.6 CL 0.4 -e-cluster 1 only 0.2 all 220 topics 0 0 15 30 45 60 75 90 105 120 135 150 165 180 day Figure 5.17 Details - Machine learning vs. Basic Bass 72 180 140% 120% r 100% U80% -60% M40% 20% 0% 15 0 30 45 60 75 90 105 120 135 150 165 180 DAY (FIRST X DAYS) Figure 5.18 Predictive accuracy of the Bass model (aggregation level is 1hr) 2 -6- Sigmoid 1 = 1/(1+exp(-2x+10)) -- &-Sigmoid 2 1.6 =2/(1+exp(-2x+10)) .Sigmoid 3 = 1.5/(1+exp(-2x+14)) 1.2 0.8 0.4 0 2 6 4 8 Figure 5.19 Different sigmoid function look similar at early stage 5.4. Modeling on a subset of topics In this section, we will model on * - All 220 topics Topics with cumulative mention count >= 10 before peak (196 topics) Topics with cumulative mention count >= 1000 before peak (158 topics) Topics in Cluster 1 (137 topics) and see if there exists a subset of topics with better predictability. 73 10 5.4.1. Machine learning on Cluster 1 only Figure 5.20 recalls the clustering result on 220 topics with K-means when K = 2. Topics in Cluster 1 has mention count curves with more fluctuations before the peak day, and therefore should provide more information for our models to predict. And Table 5.4 shows that the number of topics and topic total mention count do not differ significantly between Cluster 1 and Cluster 2. Figure 5.21 and Figure 5.22 compare the predictive accuracy of machine learning ensemble model on all 220 topics vs on Cluster 1 only. During Day 60 to Day 120, predictive accuracy of Cluster 1 is much smaller than all 220 topics. It means mention count fluctuation before peak day does offer more information and make it easier to predict the total mention count. 2clusters - #2 2clusters - #1 0.8 0 0.6 0 E 0.4 E 0 .4 II 0 .2 0.2 0 0 50 100 day 0 150 100 day 50 150 Figure 5.20 Percentage mention count every 5 days of topics in each cluster (K=2) Table 5.4 Compare topics in two clusters Cluster ID #1 #2 Number of topics 137 83 Average total mention count 1.675*106 0.506*106 74 50 45 ---- cluster 1 only aI 220 tonics o 40 l35 toD ~30 ~25 ~15 Z 10 5 0 0 150 120 90 60 days known to the model 30 180 Figure 5.21 Predictive accuracy of machine learning ensemble model - Al 220 topics vs Cluster 1 only 5 4.5 o 4 '3 3.5 cluster 1 only all 220 topicS .~3 2.5 2 -1.5 -~1 0.5 0 1 20 150 135 165 18 0 days known to the model Figure 5.22 Details after the peak day - Predictive accuracy of machine learning ensemble model - All 220 topics vs Cluster 1 only 5.4.2. The Bass model on a subset of topics From Table 5.5 and Figure 5.23, we know that at aggregation level of 1hr, compared to the average relative error over all 220 topics, a certain subset of topics have relatively smaller predicting error over most part of time horizon. For example, applying the Bass 75 model on topics with mention count before peak > 1000, predicting error after Day 30 can be reduced by 1% ~ 12% compared to the average predicting error over the all 220 topics. Table 5.5 Relative error of the Bass model on a subset of topics Days known to the model Baseline: all 220 topics count before peak >=10 count before peak >=1000 cluster 1 only 15 150% 95% 91% 163% 60 145% 96% 92% 91% 75 89% 88% 90 86% 100 110 120 130 140 150 165 180 84% 85% 82% 82% 80% 78% 75% 69% 66% 43% 13.1% 6.5% 2.0% 39% 12.0% 5.8% 1.3% 172% 94% 88% 87% 84% 79% 76% 72% 66% 56% 31% 14.5% 7.2% 2.0% 30 45 90% 76 94% 89% 88% 85% 82% 79% 76% 70% 63% 32% 9.5% 4.9% 1.2% 15% 10% ~.1 5% 0% S-5% --.-- -10% -10% 0 15 30 45 peak count >=10 (196 events) cluster 1 only 105 120 90 75 60 Days known to the model 135 150 165 180 Figure 5.23 How much relative error of the Bass model can be reduced by using a subset of topics (Data are aggregated to 1 hr) However, even focusing on the subset of topics with better predictability, predicting error before peak day is still above 60% which is too large. It again shows the importance of information outside Twitter. 77 78 6. Discussion on the Parameter Estimation Methods of the Bass model The Basic Bass model has three parameters m, p, and q. Parameter estimation procedure is essential in this study since it directly determines the predictive accuracy. In this section, we will review existing methods of estimating the Bass model parameters, and then evaluate two methods with our Twitter data. 6.1. Review of existing methods to estimate the Bass model parameters Recall that the key equation of the Bass model implies that the newly adopters within (t, t + dt) is determined by the number of adopters N(t), market potential or the total possible adopters m = N(oo), innovation coefficient p and imitation coefficient q. n (t) q p[m - N(t)] + -N(t)[n dN(t) dt dt m - N(t)] Solving the above differential equation, with boundary condition N(O) = 0, we get the expression of cumulative number of adopters N(t) = m 1 - e-(p+qt And other variables, including non-cumulative number of adopters n(t), time of peak adoption t*, and number of adopters at the peak time n(t*) can be derived as follows: n(t) = V*= p(p + q)2e-(p+q)t m [p + qe-(p+q)t]2 dN(t) dt 1 In-Ppn p+q q n(t*) = m - 4q (p + q) 2 6.1.1. Evaluate estimation methods by fitting error and one-step-ahead predicting error Mahajan et al. (1985) compared four procedures for estimating m, p, and q in the Basic Bass model when t is continuous. They are ordinary least squares (OLS), maximum likelihood estimation (MLE), nonlinear least squares (NLS), and algebraic estimation (AE). Their results showed that NLS is the better than other three in terms of both fitting accuracy, and one-step-ahead predictive accuracy. 79 .jL orinry Le (s1. Squares Es tirmatior OLS) OLS was originally suggested by Bass (1969). This method assumes time intervals are equal to unity (e.g. yearly data), and applies OLS to fit the discretized key equation of the Bass model q n(k) = p[m - N(k - 1)] + -N(k - 1)[m - N(k - 1)] m = a * N(k - 1)2 + b * N(k - 1) + c Apply OLS to get optimal a, b, and c (min MSE of n(k)), and then solve for optimal m, p, and q. -b k Vb 2 - 4ac q 2 -p+q=b-p~= ~ am ~ +bm+c= 0 = - _a - pm = C m q= -am There are two sets of solutions to the equation system on the left -b + lb 2 m1= 2a - 2 -b-b 4ac or q, = -am, - 4ac 2a 2 C q2= -am 2 Theoretically, m, p, and q are all positive, and we assume p < q to ensure the peak time > 0, therefore a = -q/m < 0, b = -p + q > 0 and c = pm > 0. So lb 2 - 4ac > b, and m, < 0, which means the first solution is counter-intuitive, and therefore we will use the second solution. Advantage of OLS are: - It is easy to implement. OLS is applicable to many diffusion models other than the Bass model. Disadvantages are: - In the presence of only a few data points, N(k - 1)2 and N(k - 1) are likely to be multicollinear, and parameter estimates are unstable or possess wrong signs. Since m, p, and q are nonlinear functions of a, b, and c, the standard errors of m, p, and q are not available. There is a time interval bias since n(k) will overestimate the derivative of N(t) taken at t = k - 1 before peak time and will underestimate after the peak time. 80 6.1.1.2. Maximum Likelihood Esitnte (MLE) Schmittlein and Mahajan (1982) suggested an MLE procedure. Let f(t) denote the probability of adoption at time t, and the C.D.F. F(t) = fof (r)dT. Then n(t) f(t) = m N(t) 1 - e~(p+qt N(t) qe-(p+qNt 1 + m Let M denote the total population, and then m is the part of population that will finally adopt when time goes to infinity. Let c = m/M. F(t) can be treated as the cumulative adoption probability at time t conditioning on that whole population will adopt finally. Then the unconditional C.D.F. is 1 -e-+4 * G(t) = cF(t) = c 1+1 e-(p+qut Then the likelihood function for the observed histogram up to T time intervals for which the data are available can be written as L(T) = [1 - G(T)] M-N(T) 171 [G(i) - G(i - 1)]n(i) Advantages of MLE include: - The time interval bias is eliminated since MLE use appropriate aggregation of the continuous time model over the time intervals represented by the data. Schmittlein and Mahajan (1982) provided formulae for approximate standard errors of p, q, and Mi to examine the stability of estimated parameters. Disadvantages are: * Estimation of m and the development of predictions require the prior knowledge of M. " The approximated standard error considers only sample error and ignores all other sources of error. 6.1-7.3. Nonlinear Least Squares Estimation (NL S) NLS was designed to overcome shortcomings of MLE. There are three variation of NLS, suggested by Srinivasan and Mason (1985), Jain and Rao (1985), and Mahajan et al. (1985), respectively fitting the following three nonlinear expressions to get m, p, and q. 81 1 n(tk) = N(tk) - N(tk1) = e-(p+q)tk m1+ e -(p+q)tk - -(p+q)tk-1 - e~(p+q)tk F~ty - F~y_1)1 where F(tk) = F(tk) 1 - F(tm1) 1+1 -(p+q)t (m - N(t1)F(tk) N(tk) = M1 e-(p+q)tk-1 p p n(tk) = 1 _ m1 + - 1 + e-(p+q)tk +1 e-(p+q)tk (2) 3 Mahajan et al. (1985) suggested NLS (2) is the best in terms of fitting error and one-stepahead predicting error. Advantages of NLS are: - It eliminates time interval bias " The procedure accounts for all sources of error thereby providing valid standard error estimates. Disadvantages are: - Slow to converge, or may not converge - Sensitive to starting values, and might converge to local minimum. Starting values need to be obtained by other parameter estimation methods. - 14 ' . Agebrfc Estimation (A E) AE was proposed by Mahajan and Sharma (1985), and was used to provide a rough estimation or good starting values for MLE and NLS. It requires knowledge at peak time t*, n(t*), and N(t*), and then n(t*)(m - 2N(t*)) 1_ t* J P m n(t*) = - (p + q)2 4q N(t*) = m q (m - N(t*)) 2 n(t*)m n(t*)) (m - N(t*))2 m m - N(t*) In m - 2N(t*) 2n(t*) pt*= 2q is used to find m numerically. Once m is known, p m In m-2N(t*) Consequently, t* = m~N(t*) 2n(t*) and q are known. Advantages: - Conceptually and computationally easy Can be used to provide good starting values for MLE and NLS 82 Disadvantages: - Does not provide standard errors for the estimated parameters it has time-interval bias Not applicable if n(t) has not yet peaked 6.1.2. Discrete Bass model In reality, time t is usually discretized. Some of the above methods can be accommodated to the situation with discrete time, such as MLE, NLS, and AE. However when OLS estimates m, p, and q, it does not require any information about time (scale of unit), and therefore applying the estimated m, p, and q to the expression of N(t) - N(t) = m 1 + 1e-(p+q# is mathematically questionable. Satoh (2001) provided the explicit formula for N(k) in discrete Bass model. It is based on Riccati equation (Hirota 1979), which says du dt = a + 2bu + cuz Parameter a, b, and c can even be function of time a(t), b(t), c(t). But in the case of discrete Bass model, they are all constants. A discrete version of Riccati equation is utt+S)-utt-6) 2 += 26 a + b * (u(t + 6) + u(t - 6)) + cu(t + 6)u(t - 6) where &is the constant time interval. Solution to discrete Riccati equation is - C+ + C.exp(fl(t - to)) 1+ exp(fl(t - to)) where + 4b 2 - ac C+ =b tanh(ffl) C = 2vb-a 83 Then discrete Bass model can be derived based on continuous Bass model and Riccati equation as follows: ) N(Na+1 +N_ 1) - Nn+1N1 Nn+1 - Nn_1 28m Nn+1 + Nn-1 + / 2 m N1+Nn)-Nn, -) And the solution to discrete Bass model is n 5(p+q) 7 1 - S \M+5(p+q)) 7 1+ I(1 - (p +q)) p\1+6(p+q)/ where n = t/S. Noted that data have to be collected periodically because the time interval 8 is a constant. And when 8 -> 0, N~t N~t) = =m -8(p + q)) 1 -Q{ \1 +8(p +q))25 M -ee~(p+q)t S S(p+q)p+q)t ) S _______ \1 + 8(p+q) 6.2. Parameter estimation procedure in this thesis In our research with Twitter data, we applied NLS, the best method according to Mahajan et al. (1985). Although Mahajan suggested fitting n(tk) will result to smaller fitting error and smaller predicting error than fitting N(tk), we will still fit N(tk) because the metric we care is different from the metric Mahajan used. We care cumulative mention count N(tk) more than noncumulative mention count n(tk), while Mahajan predicted n(tk)- So in this thesis, we fit the N(t) - t curve to find optimal m, p, and q. For example, when we have data in first x days, and we want to predict N(180), our fitting cost function is fitting error = 1 1 1 (N(t) - N(t)) t=intv The with the fitted m, p, and q, and the formula of N(t), we have R(180) as a prediction of N(180). 6..., Ealuate OLS and NLS with Twitter data With Twitter data at the aggregation level of 1hr, we applied both OLS and NLS to predict the total mention count N(180), and showed results in Figure 6.1. NLS outperforms OLS 84 over almost the whole time horizon, which is consistent to the conclusion in Mahajan et al. (1985). 200% '9 I' I 160% 0 CU I I I 120% I .01 .m. feOp. 80% 40% 0% 0 15 30 45 60 75 90 105 120 135 150 165 days known to the model Figure 6.1 Predicting error - OLS (dashed line) vs NLS (solid line) 85 180 86 7. Conclusions and Summary Much work has been done on applying diffusion models in social network analysis since Twitter introduced hashtags in 2010. Most of them tried to predict the user amount of a whole social network website, such as Wang (2011), and Bauckhage et al. (2014). A few of them focused on smaller granularity - hashtag popularity, e.g. Yang and Leskovec (2010) predicted by linear influence model, Ma et al. (2013) predicted by classification. However to our knowledge, no one has applied the Bass model to predict topic popularity, where topic is specified by keywords or hashtag. We applied machine learning regression models and the Basic Bass model to predict the total mention count of each Twitter topic. Different machine learning models - KNN, linear regression, bagged trees - outperforms at different time stages. And an ensemble model takes the advantage of all the above three models, and lead to better prediction over almost the whole time horizon, especially before the peak day. Our results also reveal the fundamental feasibility of applying the Bass model to describe Twitter topic diffusion process. Compared to machine learning models, the Bass model dramatically reduces the prediction error before the peak day (Day 120). Using only mention count, over the whole time horizon, the Bass model is no worse than machine learning models with extra features (e.g. % root tweet). It means the Basic Bass model partially captures the underlying mechanism of Twitter topic development process. And we can analogue Twitter topics' adoption process to a new product's diffusion process. There exists a subset of topics with a mention count pattern easier to predict compared to others. These topics usually have more mention count, or more fluctuation in the curve of mention count before peak day, and therefore offer more information to the model to predict the total mention count at early time stage. However, even focusing on the subset of topics with better predictability, and applying the Bass model (better than machine learning ensemble model), the predictive accuracy is still not good enough, especially before the peak day (Day 120). This is because peak day is usually triggered by news outside Twitter, therefore it is very hard to predict before the peak day without information outside Twitter. Another reason to explain the predicting error at early stage is that the Bass model assumes that cumulative mention count follows a sigmoid function, and different sigmoid functions are very similar at early stage. 87 88 8. Future Work In the future, we could first try other possible models to predict hashtag popularity. For example, Linear Influence Model (Yang and Leskovec 2010) is a choice other than Bass, to model the influence of those already tweeted to those that have not, and.to describe the diffusion process. Logistic Growth Model (Chang and Wang 2011), Weibull model, and Gompertz model (Bauckhage et al. 2014) can also be applied to describe the diffusion process. In our case, since the prediction mention count will be used later to predict demand on a new product, classification might not be accurate enough. Or, if we can tolerate the inaccuracy of classification compared to regression, then we could use classification in our future research, which would be easier than regression. The Generalized Bass model is also a choice as extension of our current the Basic Bass model, it assumes the diffusion rate is affected by real time marketing influence denoted by x(t). x(t) can be price, advertising power, or any other factors at time t. And the noncumulative mention count can be written as: n(t) = (p[m - N(t)] + qN(t)[n - N(t)] * x(t) Second, there are other features that can be include in our machine learning models, such as some features used by Ma, et al. (2013) - "Binary attribute checking whether a hashtag contains digits," "Number of segment words from a hashtag." Third, there are other available data sources, such as news websites and blogs (Yang and Leskovec 2010). Sources outside Twitter is essential in our research, because there are a large percentage of exogenous trends, trends originating from outside of Twitter (Section 2.4.1, e.g., earthquake, sports games). If we have knowledge of the trigger news outside Twitter, it would be very helpful to improve the predicting power of our models. Fourth, there might be a way to incorporate Bass diffusion model with machine learning models. During this study, we planned to predict m, p, and q based on features. We can treat m, p, and q as functions of features (e.g. % of root tweets), and learn the functions from the training set. However, in the test set, we predict m, p, and q based solely the features, not the mention count every hour. So in this way, we are not utilizing the underlying diffusion structure captured by the Bass model, and therefore not taking the advantage of the Bass model over machine learning models. So the prediction accuracy might be similar as the pure machine learning models we did before. We hope to find a good way to combine the Bass model and machine learning in the future. Finally, we can build a bridge between Twitter mention count on a certain product and the demand on this product, and therefore make our models financially meaningful to companies. 89 90 BIBLIOGRAPHY 1) Agarwal, Deepak, Bee-Chung Chen, and Pradheep Elango. "Spatio-temporal models for estimating click-through rate." Proceedings of the 18th international conference on World wide web. ACM, 2009. 2) Archak, Nikolay, Anindya Ghose, and Panagiotis G. Ipeirotis. "Deriving the pricing power of product features by mining consumer reviews." Management Science 57.8 (2011): 1485-1509. 3) Bass, Frank. "A new product growth for model consumer durables." Management Science 15.5 (1969): p215-227. 4) Bass, Frank M., Trichy V. Krishnan, and Dipak C. Jain. "Why the Bass model fits without decision variables." Marketing science 13.3 (1994): 203-223. 5) Bauckhage, Christian, Kristian Kersting, and Bashir Rastegarpanah. "Collective attention to social media evolves according to diffusion models." Proceedings of the companion publication of the 23rd international conference on World wide web 6) 7) 8) 9) 10) companion. International World Wide Web Conferences Steering Committee, 2014. elen, Bogaghan, and Shachar Kariv. "Observational learning under imperfect information." Games and Economic Behavior 47.1 (2004): 72-86. elen, Bogaghan, and Shachar Kariv. "An experimental test of observational learning under imperfect information." Economic Theory 26.3 (2005): 677-699. elen, Bogaghan, Shachar Kariv, and Andrew Schotter. "An experimental test of advice and social learning." Management Science 56.10 (2010): 1687-1701. Chang, Hsia-Ching. "A new perspective on Twitter hashtag use: Diffusion of innovation theory." Proceedings of the American Society for Information Science and Technology 47.1 (2010): 1-4. Chen, Yubo, and Jinhong Xie. "Online consumer review: Word of mouth as a new element of marketing communication mix." Management Science 54.3 (2008): 477491. 11) Guadagno, Rosanna E., et al. "What makes a video go viral? An analysis of emotional contagion and Internet memes." Computers in Human Behavior 29.6 (2013): 23122319. 12) Hirota, Ryogo. "Nonlinear partial difference equations. V. Nonlinear equations reducible to linear equations." Journal of the Physical Society of Japan 46.1 (1979): 312-319. 13) Jain, Dipak C., and Ram C. Rao. "Effect of price on the demand for durables: Modeling, estimation, and findings." Journal of Business & Economic Statistics8.2 (1990): 163170. 14) Ma, Zongyang, Aixin Sun, and Gao Cong. "On predicting the popularity of newly emerging hashtags in Twitter." Journal of the American Societyfor Information Science and Technology 64.7 (2013): 1399-1410. 91 15) Mahajan, V., C.H. Mason and V. Srinivasan: An evaluation of estimation procedures for new product diffusion models. In V. Mahajan and Y. Wind (eds.): Innovation Diffusion Models 0] New Product Acceptance (Ballinger Cambridge, Massachusetts, 1986), 203-232. 16) Mahajan, Vijay, and Subhash Sharma. "A simple algebraic estimation procedure for innovation diffusion models of new product acceptance."Technological Forecasting and Social Change 30.4 (1986): 331-345. 17) Naaman, Mor, Hila Becker, and Luis Gravano. "Hip and trendy: Characterizing emerging trends on Twitter." Journal of the American Society for Information Science and Technology 62.5 (2011): 902-918. 18) Netzer, Oded, et al. "Mine your own business: Market-structure surveillance through text mining." Marketing Science 31.3 (2012): 521-543. 19) Robinson, Bruce, and Chet Lakhani. "Dynamic price models for new-product planning." Management science 21.10 (1975): 1113-1122. 20) Rogers, E.M.. Diffusion of innovations 5th ed.. New York: Free Press (2003). 21) Satoh, Daisuke. "A discrete Bass model and its parameter estimation." Journal of the Operations Research Society of Japan-Keiei Kagaku 44.1 (2001): 1-18. 22) Schmittlein, David C., and Vijay Mahajan. "Maximum likelihood estimation for an innovation diffusion model of new product acceptance." Marketing sciencel.1 (1982): 57-78. 23) Schweidel, David A., and Wendy W. Moe. "Listening In on Social Media: A Joint Model of Sentiment and Venue Format Choice." Journal of Marketing Research 51.4 (2014): 387-402. 24) Srinivasan, V., and Charlotte H. Mason. "Technical Note-Nonlinear Least Squares Estimation of New Product Diffusion Models." Marketing science 5.2 (1986): 169-178. 25) Stouffer, Daniel B., R. Dean Malmgren, and Luis AN Amaral. "Log-normal statistics in e-mail communication patterns." arXiv preprint physics/0605027(2006). 26) Szabo, Gabor, and Bernardo A. Huberman. "Predicting the popularity of online content." Communications of the ACM 53.8 (2010): 80-88. 27) Tirunillai, Seshadri, and Gerard J. Tellis. "Does chatter really matter? Dynamics of usergenerated content and stock performance." Marketing Science 31.2 (2012): 198-215. 28) Ulrich, Rolf, and Jeff Miller. "Information processing models generating lognormally distributed reaction times." Journal of Mathematical Psychology37.4 (1993): 513-525. 29) Van Breukelen, Gerard JP. "Psychometric and information processing properties of selected response time models." Psychometrika 60.1 (1995): 95-113. 30) Wallsten, Kevin. ""Yes we can": How online viewership, blog discussion, campaign statements, and mainstream media coverage produced a viral video phenomenon." Journal of Information Technology & Politics 7.2-3 (2010): 163-181. 31) Wang, Chen-Ya. "A PRELIMINARY FORECASTING WITH DIFFUSION MODELS: TWITTER ADOPTION AND HASHTAGS DIFFUSION." (2011) 92 32) Yang, Jaewon, and Jure Leskovec. "Modeling information diffusion in implicit networks." Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010. 33) Zaman, Tauhid, Emily B. Fox, and Eric T. Bradlow. "A Bayesian approach for predicting the popularity of tweets." The Annals of Applied Statistics 8.3 (2014): 1583-1611. 93 94 Appendix A The List of T op ics Used in this Research Topic Name (denoted by a hashtag or keywords) Peak Time The total mention count Cluster ID 2Chainz 07 -May-2012 1250460 1 #AintNobodyGotTimeForThat 1i -Jul-2012 294508 #ratchet 07 -Nov-2012 483155 2 1 #ChiefKeefMakesMusicFor 17; -Sep-2012 #coolstorybro l8 -Mar-2012 7873 66080 #Struggle 07 -Nov-2012 120256 #TurntUp -Oct-2012 331318 2 #yolo 2E -Mar-2012 5190369 #ThatShitlDontLike 0! -Sep-2012 66505 #hirihanna 1; -Oct-2011 11601 #ModernSeinfeld 12 -Dec-2012 771 2 1 #DrakesMusicWillHaveYou 17 -Dec-2012 55363 2 #endoftheworldconfessions 21 -Dec-2012 581195 2 #2012regrets 27 -Dec-2012 140775 2 #MSL OE -Aug-2012 383626 2 #EDL 01 -Sep-2012 117272 1 #Sandy -Oct-2012 4600864 1 #ISS -Oct-2012 104387 1 #Synchro 05 -Aug-2012 4324 1 #Swimming 29 -Jul-2012 265618 1 #Endeavour 21 0 -Sep-2012 53826 1 #spottheshuttle 3C -Sep-2012 21J 73274 #austin 1E -Nov-2012 220137 2 1 #Bengals -Jan-2012 141095 1 #WhoDey -Jan-2012 62526 1 #pandaAl -Apr-2012 1167 2 #askneil 24 -Oct-2012 5726 2 #bondtweets 23 -Oct-2012 #london2012 27)-Jul-2012 654 5212299 2 1 #oneweb 2-)-Jul-2012 9153 2 #openingceremony 27)-Jul-2012 678721 2 ' Topic ID 95 32 #blur 02-Jul-2012 39524 33 #RoyalBaby 03-Dec-2012 140523 34 #KONY2012 07-Mar-2012 2305934 35 #StopKony 07-Mar-2012 2094112 36 #TwitternWie1989 09-Nov-2012 3358 37 #JournalistBerlin 09-Nov-2012 286 38 #Houla 27-May-2012 30123 39 #Damascus 18-Jul-2012 316768 40 #Syria 18-Jul-2012 7625671 41 #AskPele 27-Jun-2012 1379 42 #WorldCup 11-Sep-2012 63736 43 #PrayForMuamba 17-Mar-2012 441350 44 #CFC 28-Oct-2012 2846990 45 #VMARedCarpet 06-Sep-2012 67648 46 #nfltotalaccess 05-Feb-2012 6917 47 #summerwars 20-Jul-2012 97225 48 #ntv 20-Jul-2012 705685 49 Obama 07-Nov-2012 51927099 50 Gulf Oil Spill 30-Apr-2010 450756 51 Haiti Earthquake 13-Jan-2010 429935 52 Pakistan Floods 27-Aug-2010 44576 53 Koreas Conflict 23-Nov-2010 394 54 Chilean Miners Rescue 13-Oct-2010 38100 55 Chavez Tas Ponchao 25-Jan-2010 9222 56 Wikileaks Cablegate 10-Dec-2010 14081 57 Hurricane Earl 31-Aug-2010 149641 58 Prince Williams Engagement 16-Nov-2010 4290 59 World Aids Day 01-Dec-2010 153741 60 Apple iPad 27-Jan-2010 852505 61 Google Android 20-May-2010 366282 62 Apple iOS 22-Nov-2010 227047 63 Apple iPhone 07-Jun-2010 1645107 64 Call of Duty Black Ops 09-Nov-2010 430875 65 New Twitter 28-Sep-2010 2485215 96 HTC 15-Sep-2010 1250908 RockMelt 08-Nov-2010 120515 MacBook Air 20-Oct-2010 601737 Google Instant 08-Sep-2010 242225 #rememberwhen 22-Nov-2010 346771 #slapyourself 17-Nov-2010 274294 #confessiontime 07-Nov-2010 312491 #thingsimiss 05-Dec-2010 293510 #ohjustlikeme 20-Nov-2010 49351 #wheniwaslittle 10-Aug-2010 739665 #haveuever 17-Nov-2010 111262 #icantlivewithout 04-Nov-2010 157055 #thankful 25-Nov-2010 340865 #2010disappointments 02-Dec-2010 195233 toyota and recall 29-Jan-2010 9173 GM and switch 30-Mar-2014 2007 Infantino and baby sling 24-Mar-2010 46 #myNYPD 22-Apr-2014 142702 Boston marathon 15-Apr-2013 2470570 SONY and Korea 24-Dec-2014 38577 burger king and lettuce 19-Jul-2012 391 horsemeat 15-Jan-2013 339336 #McDStories 18-Jan-2012 25178 iwatch 09-Sep-2014 872266 Amazon Fire Phone 18-Jun-2014 283652 XBox One 02-Dec-2014 4877345 Kraft belVita 11-Nov-2011 189 Ebola 02-Oct-2014 37062214 Senator proposal 08-Mar-2013 2598 Japan Secrecy Bill Law 06-Dec-2013 480 Virginia AG recount 26-Nov-2013 1313 New Pontifex 13-Mar-2013 1291 IPL tournament 21-May-2013 6246 Castle in the Sky 25-Aug-2013 13460 97 100 Northern India flooding 21-Jun-2013 1688 101 Brazilian protests 22-Jun-2013 15396 102 Vine resume 21-Feb-2013 2869 103 Primetime Emmys 23-Sep-2013 8679 104 DOMA Prop8 26-Jun-2013 4843 105 Sochi Olympics 18-Dec-2013 686932 106 Salvage Costa Concordia 16-Sep-2013 20988 107 Italy election 26-Feb-2013 43998 108 World Youth Day 28-Jul-2013 35591 109 #hochwasser 03-Jun-2013 154423 110 Australian election 07-Sep-2013 40858 111 Kobe Cuban 24-Feb-2013 11516 112 German election 22-Sep-2013 32456 113 OneDirection 26-Aug-2013 544327 114 #aufschrei 25-Jan-2013 108686 115 Fashion Week 15-Feb-2013 1097929 116 Asiana 214 06-Jul-2013 55804 117 anniversary 50th Washington 118 #IranTalks 24-Nov-2013 47603 119 #NobelPeacePrize 11-Oct-2013 26398 120 France World Cup 19-Nov-2013 71267 121 Dilma 06-Sep-2013 3217901 122 typhoon Philippines 09-Nov-2013 679418 123 Red panda 24-Jun-2013 88255 124 #Troon 17-Apr-2013 169696 125 Australian Open 27-Jan-2013 367373 126 Tour de France 21-Jul-2013 687379 127 Academy Awards 25-Feb-2013 293091 128 #RockInRio 16-Sep-2013 460624 129 MTV VMAs 26-Aug-2013 254083 130 #Ashes 17-Dec-2013 1153050 131 #MalalaDay 12-Jul-2013 92522 132 #MariagePourTous 23-Apr-2013 898580 [ arch 28-Aug-2013 98 59646 133 March Madness 21-Mar-2013 1163560 134 Thatcher death 08-Apr-2013 157218 135 jason collins gay 29-Apr-2013 259639 136 #stanleycup 25-Jun-2013 546988 137 #Inauguration 21-Jan-2013 170746 138 #DoctorWho 23-Nov-2013 1828363 139 #ThankYouSachin 15-Nov-2013 1116119 140 #SFBatkid 15-Nov-2013 285846 141 #RIPMandela 05-Dec-2013 159670 142 World Cup Draw 06-Dec-2013 255758 143 #ThankYouSirAlex 08-May-2013 678244 144 #StandWithRand 07-Mar-2013 489851 145 #2013MAMA 22-Nov-2013 545670 146 #UCLFinal 25-May-2013 437898 147 #PLL 28-Aug-2013 2495315 148 #Sharknado 12-Jul-2013 547256 149 Government shutdown 01-Oct-2013 1705491 150 #StandWithWendy 26-Jun-2013 495244 151 #SB47 04-Feb-2013 568084 152 #NBAFinals 21-Jun-2013 1523961 153 Wimbledon 07-Jul-2013 2015368 154 Eurovision 18-May-2013 1693285 155 #RoyalBaby 22-Jul-2013 1592166 156 #NewYearsEve 01-Jan-2014 224751 157 #IPL 01-Jun-2014 374769 158 #Carnaval 01-Mar-2014 477170 159 #RIPPhilipSeymourHoffman 02-Feb-2014 50326 160 #SuperBowl 03-Feb-2014 2755775 161 #Oscars 03-Mar-2014 4251433 162 #UmbrellaRevolution 03-Oct-2014 277261 163 #iVoted 04-Nov-2014 28120 164 #Abdicates 05-Jun-2014 274 165 #lndiaVotes 05-Mar-2014 8172 166 #wt20 06-Apr-2014 695129 99 167 #Alia 06-Jan-2015 5528 168 #Wimbledon 06-Jul-2014 907759 169 #USOpen 06-Sep-2014 819176 170 #Sochi2014 07-Feb-2014 5069375 171 #BCSChampionship 07-Jan-2014 273487 172 #WorldCup 08-Jul-2014 20084808 173 #FrenchOpen 08-Jun-2014 115527 174 #Coachella 09-Jan-2014 76383 175 #TDF14 09-Jul-2014 18734 176 #BerlinWall 09-Nov-2014 77477 177 #NYFW 09-Sep-2014 653920 178 #BringBackOurGirls 10-May-2014 3868601 179 #MalalaYousafzai 10-Oct-2014 152659 180 #ThanksLD 10-Oct-2014 103118 181 #Spain2014 10-Sep-2014 953698 182 #RIPRobinWilliams 11-Aug-2014 2757478 183 #ComingHome 11-Jul-2014 29180 184 #CometLanding 12-Nov-2014 570908 185 #GoldenGlobes 13-Jan-2014 1618349 186 #GermanyWins 13-Jul-2014 1468 187 #Ferguson 14-Aug-2014 9921883 188 #TheVoiceAU 14-Jul-2014 221374 189 #StanleyCup 14-Jun-2014 490531 190 #NBAFinals 16-Jun-2014 1333518 191 #Formulal 16-Mar-2014 273901 192 #NBAAllStar 17-Feb-2014 375858 193 #MH17 17-Jul-2014 4044279 194 #OnlyOnTwitter 18-Feb-2015 2707 195 #lceBucket Challenge 19-Aug-2014 22373 196 #BRITAwards 19-Feb-2014 45761 197 #LoveTheatre 19-Nov-2014 46692 198 #lndyRef 19-Sep-2014 5440172 199 #NFLPlayoffs 20-Jan-2014 586196 200 #FirstTweet 20-Mar-2014 357941 100 201 #MarchMadness 21-Mar-2014 1574794 202 #Glasgow20l4 23-Jul-2014 811096 203 #ISS 23-Nov-2014 492621 204 #MuseumWeek 24-Mar-2014 203938 205 #MH370 24-Mar-2014 4837319 206 #Cannes2014 24-May-2014 720491 207 #MarsOrbiter 24-Sep-2014 16032 208 #VMAs 25-Aug-2014 3082229 209 #HeForShe 25-Sep-2014 812374 210 #PhotoshopRF 25-Sep-2014 11479 211 #Emmys 26-Aug-2014 863563 212 #AusOpen 26-Jan-2014 804617 213 #Eleig6es2014 26-Oct-2014 411975 214 #DerekJeter 26-Sep-2014 149596 215 #Grammys 27-Jan-2014 3454718 216 #RIPMaya Angelou 28-May-2014 2705 217 #PutOutYourBats 28-Nov-2014 147452 218 #ModilnAmerica 28-Sep-2014 93461 219 #SOTU 29-Jan-2014 1391601 220 #WorldSeries 30-Oct-2014 1137844 101 102 Appendix B Predicted the Bass model Curves of a Sampled 8 Topics with only Partial Data Known to the Model 14 x 1o 14 S10i. E 5d: E E 1 >D 6 true! -1hr 1d 10 5d 8 6 4 4 E - E 2 0 x 14 E .2 105 Event I - 2Chainz-120days C S81 C 0 0 C :3 x 12 1 hr l d, : 12 . Event I - 2Chainz-90days truei 50 100 day 00 200 150 2 o5 Event I - 2Chainz-1 50days 105 Event 14 200 150 100 day 50 I - 2Chalnz - 180days 12 12 1 10 1 0(3 8 E E 8 6 6 Irue --i1 hr d 54 d ....... tru 4E E 2 E r2 2i 1 5d 0 0 50 100 day 150 50 0 200 Ever * #AintNobodyGotTmeForThat-90days Evens 200 150 100 day 6#AintNobodyGotTimeForThat-120days 3, true S 2.51 --- 1hr (3 1d 5d 2 -2~2 idC E 1.5 E 75 1hr ild 5d 1.5 E 0.5 50 0 0 50 100 150 -0.5 200 - - -------- 0 50 100 day day 103 - - - 0.5 - E E true . 150 200 r>#AntNobodyGotTmeForThat-1 50days EventA2 Event 31 3 0 AintNobodyGotTimeForThat - 180day- 2.5 5 2.5 0 2 E 1.5 true 1 3 E -true 1 lhr 0.5 0 -5 - 50 0 day x 17Event true - -lhr 1d C5r 200 150 xIvent 5 - #coolstorybro-120days 5 - #coolstorybro-90days true c6 0 100 day id 5d S - -- 1hrI c5 S5d 04 0 50d 0 50 E 3 - 0 .2 E E E I ~~vent~ 5 1 0 50 - 100 day 2 E -cosoyr -5dy 200 150 100 day 200 150 @vent 5 - #coolstorybro - 180days 7 vent 5 - #coolstorybro-1 50days 6 0 Vid 6 0 c 4 E S3 4) Ef 3 -truel E 2f 0 50 2 Ef 1-hr I 1d id E 5d 5d 100 day 150 true' 0 200 50 100 day 104 150 200 ......truei r 2.5 10 Event x 3 x 10Event 20 - #Swimming-90days lhrI true - -- 1hr 1d 2.5 0 C)Id E 1.51 E 1.5 E E E S0.5 0.5 0 100 - 0 5d 2 5d o 2 20 - #Swimming-120days 50 0 200 150 100 day 50 0 day x 3 0 10Event 20 - #Swimming-150days C 0 c a, 0 0 2 E 0) 0 2.5 2 E a) Z 1.5 E5 E E vent 20 - #Swimming - 180days 3 2.5 200 150 E 1. 5 1 1 1 .true 1hr 0.5 - 0.5. d 5d 5d 0 - 0 100 day 50 0 150 200 50 . .1 ihr 1d 0 5d 10000 20 0 true hr id 0 150 15000 true - 100 day Event 56 - Wikileaks Cablegate-120days Event 56 - Wikileaks Cablegate-90days 15000 0 true - 1hr 1d E 10000 5d E 5000 - E 5000 0 0 50 100 day 150 0~ 0 200 50 100 day 105 150 200 Event 56 - Wikileaks Cablegate-150days Event 56 - Wikileaks Cablegate - 180days 150001, 15000!: C 0 (-3 ,10000 0000 E E true 75000: - tru e 5000 E 1 hr -hr 1d - - 0 100 day 50 50 0 200 150 4.x15 Event 157- #IPL-150days 3 5d 5d .. -..... ........ - . 0 4 rx 105 ..... 150 100 day 200 Event 157 - #IPL-1 2 0days 3- . E 2 E2 ... -- ... truel -- hr . E z U2 t rue: hr j1 d2 5d E1I Id - -...........-..... 0 0 50 100 150 3 0 200 0 day - 83 200 150 day Event 157 - #IPL-90days .Io 4, < 5d . ........ -100 50 "'o1 true hr 1d Sd Event 157 -#IPL-180days 4--- -~---- - E 1 3 E E2 E E 2P . true --- lhr Ea E 1 r E1 id 5d 0 0 - - --.-.-.-.-..-... 150 100 50 day 0 200 106 50 100 day 150 20 0 Avent 213 - #Eleig6es2Ol4-90days .- true hr 4 y ent 213 - #Eie6es2014-120days - ... 321 o 2 0 ',1 4 o 5d E0 50 0 100 200 150 day 10 2 day 2) (3 213 - #Eieig6es2O14 - 180days 5 tnt o4 ~ 3 E 21 E 1 ~ ~~ 0 - 0 15 50 100 day 150 200 truel 150 200 true 1hr 1d 0 5d 10 100 day #vent 214 - #DerekJeter-120days --- lhr Id (3 50 0 x 1 fvent 214 -#DerekJeter-90days ........ ... 0 -1 hr d 5d o 5d 0 tr ue, E E 11 ' E 0 d0 day nt 213 - #Elei(;6es2O14-1 5Odays 5x 150 E3 10? 5d E E ~ > 5 E 5' EE 0 E E 0 0 50 100 day 150 0 20 0 107 50 100 day 150 200 15 vent 214 - #DerskJeter-150days true. lhr . ovent 214 - #DerekJeter - 180days 2 c 0 1, 75 0 c E E 1.5 4) E E 1 true I hr E 0.5 E 0 day 108 Id 5d 0 50 100 day 150 20 0