Forecasting Twitter Topic Popularity Using Bass Diffusion AWES JUL

Forecasting Twitter Topic Popularity Using Bass Diffusion
Model and Machine Learning
AWES
by
MASSACHUSETTS INSTITUTE
Yingzhen Shen
JUL 02 2015
Bachelor of Science in Information Science and Technology
Tsinghua University, Beijing, China, 2013
LIBRARIES
OF rECHNOLOLGY
SUBMITTED TO THE DEPARTMENT OF CIVIL AND ENVIRONMENTAL ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN TRANSPORTATION
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
MAY 2015
( 2015 Yingzhen Shen. All rights reserved.
The author hereby grants to MIT permission to reproduce and to distribute publicly
paper and electronic copies of this thesis document in whole or in part in any medium
now known or hereafter created.
Signature redacted
....
Signature of Author: ..............................................
Department of Civil and Environmental Engineering
May 18, 2015
Certified by: .....
Signature redacted .......
..........................................
David Simchi-Levi
Professor of Civil and Environmental Engineering
Thesis Supervisor
Accepted by: ..........
....................................
.
Signature
redacted
H..i Nepf
.
.
Donald and Martha Harleman Professor of Civil and Environmental Engineering
Chair, Graduate Program Committee
MITLibraies
77 Massachusetts Avenue
Cambridge, MA 02139
http://Iibraries.mit.edu/ask
DISCLAIMER NOTICE
Due to the condition of the original material, there are unavoidable
flaws in this reproduction. We have made every effort possible to
provide you with the best copy available.
Thank you.
The images contained in this document are of the
best quality available.
Forecasting Twitter Topic Popularity Using Bass Diffusion
Model and Machine Learning
by
Yingzhen Shen
Submitted to the Department of Civil and Environmental Engineering
in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Transportation
ABSTRACT
Today social network websites like Twitter are important information sources for a
company's marketing, logistics and supply chain. Sometimes a topic about a product will
"explode" at a "peak day," suddenly being talked about by a large number of users.
Predicting the diffusion process of a Twitter topic is meaningful for a company to forecast
demand, and plan ahead to dispatch its products.
In this study, we collected Twitter data on 220 topics, covering a wide range of fields. And
we created 12 features for each topic at each time stage, e.g. number of tweets
mentioning this topic per hour, number of followers of users already mentioning this topic,
and percentage of root tweets among all tweets. The task in this study is to predict the
total mention count within the whole time horizon, 180 days, as early and accurately as
possible. To complete this task, we applied two models - fitting the curve denoting topic
popularity (mention count curve) by Bass diffusion model; and using machine learning
models including K-nearest-neighbor, linear regression, bagged tree, and ensemble to
learn the topic popularity as a function of the features we created.
The results of this study reveal that the Basic Bass model captures the underlying
mechanism of the Twitter topic development process. And we can analogue Twitter
topics' adoption to a new product's diffusion. Using only mention count, over the whole
time horizon, the Bass model has much better predictive accuracy, compared to machine
learning models with extra features. However, even with the best model (the Bass model)
and focusing on the subset of topics with better predictability, predictive accuracy is still
not good enough before the "explosion day." This is because "explosion" is usually
triggered by news outside Twitter, and therefore is hard to predict without information
outside Twitter.
Thesis Supervisor: David Simchi-Levi
Title: Professor of the Department of Civil and Environmental Engineering
3
4
BIOGRAPHICAL NOTES
The author graduated in 2009 from Tsinghua University with a Bachelor's degree in
Information Science and Technology. She has research experience in the field of
transportation choice behavior and the airline industry, and has industrial working
experience in the field of financial data mining. She is now a research assistant of Prof.
David Simchi-Levi in MIT Civil and Environmental Engineering Department. Her research
interests include social network big data, machine learning, and optimization.
5
6
ACKNOWLEDGEMENTS
I wish to take this opportunity to appreciate all the help from professors, lab mates,
classmates, and friends during my two years in the Department of Civil and Environmental
Engineering at MIT. I am so fortunate to study and do research in MIT with a group of
excellent and friendly people.
Deep appreciation goes to my advisor, Prof. David Simchi-Levi. His understanding, trust,
encouragement, and expertise has always motivated me to learn more, and dig deeper in
my research. And I believe this experience of working in his group will influence my future
positively and profoundly.
I would also like to show gratitude to my co-worker, Xiaoyu Zhang, who provided the idea
of applying the Bass model, and enormous help during the whole research process.
7
8
CONTENTS
ABSTRACT............................................................................................................................
3
BIOGRAPHICAL NOTES ....................................................................................................
5
ACKNOW LEDGEM ENTS..................................................................................................
7
CONTENTS...........................................................................................................................
9
List of Tables .....................................................................................................................
12
List of Figures ....................................................................................................................
13
1.
Introduction ...........................................................................................................
15
2.
Literature Review ..................................................................................................
17
2.1.
Online word of m outh and m arketing ...........................................................
17
2.2.
Prediction popularity of individual tw eet ......................................................
18
2.3.
Bass diffusion m odels and extensions ...........................................................
23
2.3.1.
Diffusion/Adoption process .....................................................................
23
2.3.2.
Basic Bass Diffusion M odel .....................................................................
24
2.3.3.
Use the Bass m odel to fit or predict .......................................................
26
2.3.4.
Bass curve shape w ith different p and q ...............................................
27
2.3.5.
The Generalized Bass m odel...................................................................
28
Integrating social network into econom ic diffusion m odels ..........................
29
2.4.
2.4.1.
Tw o types of Tw itter topics w ith different predictability........................ 29
2.4.2.
Predict Tw itter hashtag popularity by classification............................... 29
2.4.3. Apply Diffusion Models to Predict the User Amount of Social Network
30
W ebsites....................................................................................................................
2.4.4. Predict hashtag/keywords popularity when the popularity is defined
32
continuously..........................................................................................................
3.
Data Overview .......................................................................................................
35
3.1.
Definitions .......................................................................................................
35
3.2.
Data source - Topsy API .................................................................................
35
3.3.
Data size - number of topics and time horizon of each topic ........................
38
3.4.
Raw data and its form at.................................................................................
39
9
4.
3.5.
Data processing before m odeling ...................................................................
41
3.6.
Other data sources providing Twitter data ....................................................
42
M o d e ls ......................................................................................................................
4.1.
Dependent and Independent Variables in Machine Learning Models.......... 43
4.2.
Metrics used to evaluate models...................................................................
44
4.3.
Machine Learning Regression Models ...........................................................
45
4.3.1.
K-nearest-neighbor with feature selection............................................. 45
4.3.2.
Linear regression with feature selection ...............................................
46
4.3.3.
Bagged regression tree ..........................................................................
46
4.3.4.
Ensem ble m odel .......................................................................................
46
The Basic Bass m odel ......................................................................................
47
4.4.
4.4.1.
M odel form ulation ..................................................................................
4.4.2.
Three time aggregation levels - Fitting error and Forecasting error ......... 48
4.4.3.
Define the predicted cumulative mention count ...................................
4.5.
47
49
M odeling on a subset of topics ......................................................................
49
4.5.1.
Determine the number of clusters .........................................................
50
4.5.2.
Clustering results ...................................................................................
58
4.5.3.
Filtering topics by the total mention count before peak day.................. 59
4 .6.
5.
43
N otatio ns .......................................................................................................
. . 59
Results - Predicting Cumulative Mention Count..................................................
5.1.
M achine learning m odels results ...................................................................
61
61
5.1.1.
K-nearest-neighbor with feature selection............................................. 61
5.1.2.
Linear regression with feature selection ...............................................
5.1.3.
Bagged tree ...........................................................................................
5.1.4.
Ensem ble m odel .......................................................................................
65
The Basic Bass m odel results...........................................................................
67
5.2.
63
. . 64
5.2.1.
m or N180 - Which one is better in prediction? .................
67
5.2.2.
Three tim e aggregation levels ................................................................
69
5.3.
Machine learning vs. the Bass model............................................................
71
5.4.
M odeling on a subset of topics ......................................................................
73
5.4.1.
Machine learning on Cluster 1 only .........................................................
10
74
The Bass model on a subset of topics.....................................................
75
Discussion on the Parameter Estimation Methods of the Bass model ................
79
5.4.2.
6.
6.1.
Review of existing methods to estimate the Bass model parameters .......... 79
6.1.1.
error
Evaluate estimation methods by fitting error and one-step-ahead predicting
79
6.1.2.
Discrete Bass m odel...............................................................................
83
6.2.
Parameter estimation procedure in this thesis .............................................
84
6.3.
Evaluate OLS and NLS with Twitter data .......................................................
84
87
7.
Conclusions and Sum m ary ....................................................................................
8.
Future W o rk .........................................................................................................
. 89
BIBLIO G RA PHY .............................................................................................................
. 91
Appendix A
The List of Topics Used in this Research .............................................
95
Predicted the Bass model Curves of a Sampled 8 Topics with only Partial
Appendix B
Data Know n to the M odel............................................................................................... 103
11
List of Tables
Table 2.1 Log-likelihood (LL) and Deviance information criterion (DIC) for a 100%
observation fraction for the full Retweet Model and a nested Strawman Model - Zaman
et a l. (2 0 14 ).......................................................................................................................
22
Table 2.2 Features used in Ma et al. (2013) - 7 content features (Fc) and 11 contextual
featu res (Fx ).....................................................................................................................
Table 2.3 Five Data Sources for Wang (2011)..............................................................
Table 3.1 Topsy A PI ......................................................................................................
30
32
. 37
Table 3.2 Filter param eters of Topsy API .....................................................................
38
Table 3.3 Raw Data Part 1 (collected by hour) - Variables for each topic.................... 40
Table 3.4 Raw Data Part 1 -Topic 89 (iWatch) as an Example ....................................
40
Table 3.5 Raw Data Part 2 - Sheet A: Count of influencers mentioning the topic .....
41
Table 3.6 Raw Data Part 2 - Sheet B: Average follower count of influencers mentioning the
ce rtain to p ic ......................................................................................................................
41
Table 3.7 Raw Data Part 2 - Sheet C: Sum of the influential levels of all influencers....... 41
Table 4.1 Independent variables (features).................................................................
44
Table 4.2 Variations of KN N ...........................................................................................
Table 4.3 Variations of Linear Regression models .......................................................
Table 4.4 Variations of Tree m odels ............................................................................
45
46
46
Table
Table
Table
Table
51
52
58
58
4.5
4.6
4.7
4.8
Percentage mention count every 5 days (vector being clustered)...............
Principal components of mention count vector ..........................................
Determ ine num ber of clusters.....................................................................
Number of topics in each cluster (k-means)..................................................
Table 5.1 The 8 m achine learning m odels ...................................................................
61
Table 5.2 Best m odel at each tim e stage .....................................................................
65
Table 5.3 Fitting accuracy (relative error with 100% data) with m or N180 .........
Table 5.4 Com pare topics in two clusters.....................................................................
67
74
Table 5.5 Relative error of the Bass model on a subset of topics ................................
76
12
List of Figures
Figure 1.1 Twitter mention count of iWatch ...............................................................
15
Figure 2.1 Spread of One Sample Root Tweet - from Zaman et al. (2014)................... 19
Figure 2.2 Distribution of Reaction Time for 1 Root Tweet - Zaman et al. (2014) ..... 20
Figure 2.3 Prediction of the total number of retweets for two root tweets - Zaman et al.
(2 0 14 )................................................................................................................................
20
Figure 2.4 Median absolute percentage error (MAPE) of Zaman's Retweet Model and
21
Be nch m a rks.......................................................................................................................
Figure 2.5 MAPE of Retweet Model vs Benchmarks with no knowledge on the network
structure m odel - Zam an et al. (2014) .........................................................................
22
23
Figure 2.6 Innovators and Im itators ..............................................................................
24
Figure 2.7 Number of Innovators and Imitators over time ..........................................
25
Figure 2.8 The shape of N(t) curve and n(t) curve ......................................................
Figure 2.9 Actual sales and predicted sales for room air - Bass (1969) .......................
Figure
Figure
Figure
Figure
2.10
2.11
2.12
2.13
26
Predict Color TV sales based on the first three points - Bass (1969)......... 27
Comparison of different sigmoid function .................................................
27
Shape of n(t) curve - q>p on the left, q<p on the right.............................. 27
Shaper of n(t) curve with different values of p (innovation) and q (contagion)
...........................................................................................................................................
28
Figure 2.14 Precision and Recall (source: wikipedia).................................................... 30
Figure 2.15 Twitter Cumulative Adoptions: 5 Years since Introduction -True (left) vs Fitted
(right) - W ang (2011).................................................................................................... . 31
Figure 2.16 Predict future Google search frequency for "YouTube"(left) and "Twitter"
(right) - Bauckhage et al. (2014) ...................................................................................
31
Figure 2.17 The Linear Influence M odel .......................................................................
33
Figure 2.18 Influence function across different fields - (a) Politics, (b) Nation, (c)
Entertainment, (d) Business, (e) Technology, (f) Sports ...............................................
34
Figure 3.1 Topsy Hom epage ..........................................................................................
36
Figure 3.2 An Example of Using Topsy - Tweets on 'iWatch' per day........................... 36
Figure 3.3 Tweet Count on " iWatch" v.s. "iPhone 65".................................................. 36
Figure 3.4 Time Horizon and Mention Count of Topic 89 - iWatch ............................. 39
Figure 3.5 Keyhole's Results for "iW atch" ....................................................................
42
Figure 4.1 Distribution of the total mention count (220 topics) ..................................
43
Figure 4.2 True vs Fitted cumulative mention count curve........................................... 47
Figure 4.3 Compare fitted curves with aggregation level of 1hr, iday, and 5 days......... 48
Figure 4.4 Non-cumulative mention count of topics with/without fluctuation before peak
d ay .....................................................................................................................................
50
Figure 4.5 Plot the First 2 Principal Components ........................................................
52
Figure 4.6 Plot the First 3 Principal Components ........................................................
53
Figure 4.7 Illustration of Elbow m ethod .......................................................................
53
13
Figure 4.8 Elbow method when the "elbow" cannot be identified ...............................
Figure 4.9 Elbow method cost as a function of K, with our Twitter data .....................
53
54
Figure 4.10 Information criteria with the first 6 principal components of mention count
v e cto r ................................................................................................................................
Figure 4.11 Average Silhouette with our Twitter data .................................................
Figure 4.12 An exam ple of Gap criterion .....................................................................
55
56
57
Figure 4.13 Gap values with our Twitter Data (Input is reduced to 16-dim; Clustered by
Gaussian M ixture M odel).............................................................................................
57
Figure 4.14 Percentage mention count every 5 days of topics in each cluster (K=2) ...... 59
Figure 5.1 Relative error of KNN variations ...................................................................
62
Figure 5.2 Details after peak day - relative error of KNN variations ............................ 62
Figure 5.3 Relative error of linear regression models .................................................
63
Figure 5.4 Details after peak day - relative error of linear regression models............. 63
Figure 5.5 Relative error of bagged tree models......................................................... 64
Figure 5.6 Details after peak day - relative error of bagged tree models.................... 64
65
Figure 5.7 Relative error of Ensem ble model ..............................................................
Figure 5.8 Details after peak day - relative error of ensemble model.......................... 66
Figure 5.9 Compare relative error of four types of models........................................... 66
Figure 5.10 The Bass model predictive accuracy with m or N180................ 68
68
Figure 5.11 Cumulative mention count of Topic 8 - #YOLO ..........................................
Figure 5.12 Predictive accuracy with 3 aggregation levels........................................... 69
Figure 5.13 Details - predictive accuracy with 3 aggregation levels............................. 69
Figure 5.14 Fitted curve when the Bass model is fed with data in the first 90, 120, 150,
70
180 days (Topic 89 - iW atch).........................................................................................
Figure 5.15 Fitted curve when the Bass model is fed with data in the first 90, 120, 150, 180
. 71
days (Topic 203 - #ISS) .................................................................................................
Figure 5.16 M achine learning vs. Basic Bass................................................................. 72
Figure 5.17 Details - Machine learning vs. Basic Bass.................................................. 72
Figure 5.18 Predictive accuracy of the Bass model (aggregation level is 1hr) .............. 73
Figure 5.19 Different sigmoid function look similar at early stage............................... 73
Figure 5.20 Percentage mention count every 5 days of topics in each cluster (K=2) ...... 74
Figure 5.21 Predictive accuracy of machine learning ensemble model - All 220 topics vs
75
C luste r 1 o n ly ....................................................................................................................
Figure 5.22 Details after the peak day - Predictive accuracy of machine learning ensemble
75
m odel - All 220 topics vs Cluster 1 only ........................................................................
Figure 5.23 How much relative error of the Bass model can be reduced by using a subset
77
of topics (Data are aggregated to 1hr) ..........................................................................
Figure 6.1 Predicting error - OLS (dashed line) vs NLS (solid line)............................... 85
14
1. Introduction
On social network websites such as Twitter, some topics will "explode" at a "peak day,"
suddenly being talked about by a large number of users. Some users hear about this topic
from TV, newspaper, or new websites; while others learn from their Twitter feeds every
day, including tweets posted by their Twitter friends.
If the topic is about a product, then these tweets offer insights into the logistics and supply
chain of this product. For example, if we can predict whether or when the "explosion" will
happen, then the company can plan ahead to increase production. Or if the comments on
the topic is negative, such as "Toyota brake," then the prediction of "explosion" will help
the company to detect the problem and recall the products earlier. Figure 1.1 shows the
number of tweets posted about the topic "iWatch," both cumulative and non-cumulative.
And the peak day is on Sep 9, 2014, the day Apple's CEO announced iWatch.
16
16
o14
3
10
Cumulative mention count
--
8
non-cumulative mention
X
12
count
peak day,
2014/9/9
10
9
E
6
5
8
4
6
(
_
I
3
4
2
Y
E
0
0
Figure 1.1 Twitter mention count of iWatch
In this study, we use "diffusion" to denote the process that a topic is known by more and
more users on Twitter; and "adoption" is the process that users tweet, retweet, or reply
a topic. We collected Twitter data on 220 topics, covering a large variety of fields
(Appendix A). For each topic, we collect its mention count on Twitter from 120 days
before the peak day, to 60 days after the peak day. And we created 12 features for each
topic at each time stage, e.g. number of followers of users already mentioning this topic,
and percentage of root tweets among all tweets.
15
The task in this study is to predict the total mention count for every topic within the 180
days, as early and accurately as possible. To complete this task, we will apply the Bass
diffusion model to fit the mention count curve; and apply machine learning models
including K-nearest-neighbor, linear regression, bagged tree, and ensemble to learn the
total mention count as a function of the features we created.
To our knowledge, this study is the first to apply the Bass model to predict Twitter topic
popularity, where a topic is specified by keywords (e.g. "Jeremy Lin") or hashtag (e.g.
"#JeremyLin"). The results of this study reveal that the Basic Bass model captures the
underlying mechanism of the Twitter topic development process. And we can analogue
Twitter topics' adoption to a new product's diffusion. Using only mention count, over the
whole time horizon, the Bass model has much better predictive accuracy, compared to
machine learning models with extra features (e.g. %root tweet).
However, even with the best model, the Basic Bass model, and focusing on the subset of
topics with better predictability, predictive accuracy is still not good enough before the
peak day (Day 120). This is because peak day is usually triggered by news outside Twitter,
and therefore is very hard to predict before the peak day without information outside
Twitter.
The rest of this thesis is organized as follows. In Section 2, we will review previous work
related to the diffusion process and online word of mouth (WoM), especially Twitter.
Section 3 will give an overview of the Twitter data we collected and processed. In Section
4, we will describe how to model Twitter data by machine learning models and the Bass
model. Model results will be discussed in Section 5. In Section 0, we will discuss the
parameter estimation procedures for the Bass model. Finally Section 0 will draw
conclusions, and Section 0 will provide insights to future research directions.
16
2. Literature Review
In this section, we will review previous research on predicting the impact and popularity
of online word of mouth, especially by using economic diffusion models.
2.1. Online word of mouth and marketing
Individuals learn by observing the behavior of others and advice from others. This
situation is called social learning. For example, people would buy products based on what
their friends have bought, and also their friends' suggestions. Due to the wide use of the
Internet nowadays, the connection between people not only exists in the real world, but
also online. Therefore today, an important marketing channel is through online WoM, e.g.
Amazon product reviews, blogs, and social network websites.
There exist lots of literature investigating:
-
How information (text, photo, or video) diffuses and spreads among a population
-
What kind of information will go viral
What role online WoM play in marketing
How to modify marketing strategy to utilize the business value of online WoM
To deeply understand the mechanism behind the social learning process, elen and Kariv
(2004) modeled the information diffusion process as a Bayes-rational sequential decision
making process, where each decision maker observes only his/her predecessor's binary
action. Later elen and Kariv (2005) and elen et al. (2010) designed experiments to
confirm this theory, under perfect information condition when individuals can observe all
the decisions that have previously been made, and imperfect information condition.
Another type of research on the mechanism of online information diffusion was about
videos, predicting whether or not a video will go viral based on the features of the video
and the network structure. Wallston (2010) studied the factors that lead viral videos to
spread across the Internet. It assessed the relationships that drive viral videos by
examining the interplay between audience size, blog discussion, campaign statements,
and mainstream media coverage of the video. Guadagno et al. (2013) examined the role
of emotional response and video source on the likelihood of spreading an Internet video.
Their results indicated that individuals reporting strong affective responses to a video
reported greater intent to spread the video; and anger-producing videos were more likely
to be forwarded.
Archak et al. (2011) connected online WoM to marketing by incorporating Amazon
product review text in a consumer choice model by decomposing textual reviews into
segments describing different product features. They demonstrated how textual data can
be used to learn consumers' relative preferences for different product features, and also
how text can be used for predictive modeling of future changes in sales.
17
Netzer et al. (2012) found it possible to monitor market structure through text mining on
online user generated content (UGC), therefore converting the UGC to market structures
and competitive landscape insights.
Another way to get insights from online WoM was to derive brand sentiment from format
of mini-blogs posted online (e.g. length of product review), proposed by Schweidel and
Moe (2014).
The online WoM is so impactful and insightful, that Tirunillai and Tellis (2012) discovered
even the stock market performance was significantly related to online UGC. And
interestingly, the effect of negative and positive UGC on abnormal stock returns is
asymmetric. Whereas negative UGC has a significant negative effect on abnormal returns
with a short "'wear-in" and long "wear-out," positive UGC has no significant effect on
these metrics.
The wide impact and business power of online WoM motivated researchers to examine
when and how the seller should adjust its own marketing communication strategy in
response to consumer reviews. Chen and Xie (2008) discovered that offering consumer
review information too early leads to a lower profit. In addition, the optimal online
marketing strategy should depend on product characteristics, the informativeness of the
review, the seller's product assortment strategy, the seller's product value for the
partially matched consumers, and consumer heterogeneity in product consumption
expertise.
Starting from the next section, we will focus on text on social network websites, e.g.
Twitter or Facebook, and leave videos, photos, and online retailer product reviews aside.
2.2. Prediction popularity of individual tweet
In this section, we will focus on a fundamental and theoretical model to predict the spread
of an individual tweet, which was proposed by Zaman et al. (2014). When a root tweet,
which is an original tweet created and posted by a Twitter user, is posted on Twitter,
followers of this root user will see it on their personal Twitter pages, and might retweet
it. In this way, one root tweet is spreading in the Twitter network, and might be known to
a large number of users in the end. Zaman et al. (2014) predicted the number of retweets
when time goes to infinity, which is an evaluation of root tweet popularity.
Given a root tweet and the network structure, which is the "follow" relationship among
users, Zaman et al. (2014) built a Bayesian model to forecast the total number of retweets
that can be several hops from the root tweet.
The data they used were retweets from 52 root tweets (26 training and 26 testing), and
the data set is available on Zaman's homepage: http://www.zlisto.com/. Figure 2.1
18
shows the data for the root tweet "Cory Booker has never worked a day in his life. Not.
#corybookerstories" by root user "pbsgwen." The plot on the upper left shows the
number of retweets of the root tweet versus time. The table on the upper right shows
several users retweeting this tweet. The images of the retweet graph at different times
are shown below, from which we can see the spreading process of this root tweet in the
Twitter network.
V
j
2
3
1 4
0
<7
I
f
1
tvcj
W~
d'
0
pbsgwerJ
236731
0
2
hni 0
2030
13
0
edenWxIA
keithboykin
drugmonkpyb o
928
8048
198/
3
88
160
0
0
1
1
23
194
0
0
1
niveQ
4589
4658
0
20
0
&A.
Time [hrs])
curt af
76
222-l
2
.
-21
Figure2.1 Spread of One Sample Root Tweet -from Zaman et al. (2014)
Zaman made two assumptions for the spreading process of one root tweet:
1)
Reaction time, time interval between parent tweet and root tweet, follows log-normal
distribution. Distribution parameters are root tweet specific, different across root
tweets. Figure 2.2 is a plot of reaction time distribution, based on the data for one
root tweet posted by user "KimKardashian." The observation of log-normally
distributed reaction times has occurred in other application areas, e.g. the time for
people to respond to emails (Stouffer, Malmgren and Amaral, 2006), and call
durations in call centers (Brown et al. 2005). And its psychological and fundamental
explanation exists in Ulrich and Miller 1993, and Van Breukelen 1995.
19
2) The decision to retweet or not follows a binomial distribution, and the distribution
parameters are root-tweet- and user-specific.
1
KimKardashian
768 retweets
08
C
S0.6
V"
4
Log-normal
distribution
0
0.2
0
0
150
100
50
Reaction Time [minutes]
Figure 2.2 Distribution of Reaction Time for 1 Root Tweet - Zaman et al.
(2014)
Figure 2.3 shows part of the results, comparing the true and predicted value of the total
number of retweets for two root tweets. The solid line is the number of observed
retweets versus time. The error bars correspond to the 90% credible intervals of the
predictive distribution for the total number of retweets based on observations only up to
that time point. And the point in the middle of error bars is the posterior median of the
predictive distribution for the total number of retweets.
TheRock: 1260 retweets
KimKardashian: 768 retweets
the final number of
observed retweets
ow
20-
1500
Soo
0
2
40
60
80
100
120
20
40
60
s
100
120
Time [minutes]
Time [minutes]
Figure 2.3 Prediction of the total number of retweetsfor two root tweets -Zaman
et al. (2014)
20
Zaman then compared his Retweet Model to three benchmarks (Figure 2.4). The metric
for evaluating models is Median absolute percentage error (MAPE), which is the median
of absolute percentage predicting error among 26 testing samples. MAPE drops with
higher observation fraction, which makes sense. The three benchmarks are listed below:
1) Linear regression model
log(MX) = flo + fl log(foJ) + EX
where MX is the final total number of retweets of root tweet x, fox is the follower
count of the root user, and E' is a zero mean, normally distributed error term. This
model uses no temporal information and only the follower count of the root user,
and has an MAPE of 65%.
2) Regression model (Szabo and Huberman, 2010)
log(MX) = f3(t) + log(mx(t)) + EX
which uses only the current retweet count mx(t) of root tweet x.
3) Dynamic Poisson model with exponentially decaying rate (Agarwal, Chen and Elango,
2009). This model bins time into 5 minute intervals indexed by k. And the number of
retweets in the kth bin is a Poisson random variable with rate Adk.
Figure 2.4 shows that Zaman's Retweet Model outperforms for any observation fraction.
The intuitive reason is that it uses both temporal information and network structure
information.
100
INDynamic
Poisson Model
-
60
Regression Model
A
<40
Retweet Model
20
0
20
60
40
80
100
Observation Fraction [%]
Figure 2.4 Median absolute percentage error (MAPE) of Zaman's Retweet Model
and Benchmarks
21
To demonstrate the impacts of the information of network structure, Zaman compared
his Retweet Model to two other benchmark models with no knowledge about network
structure. One is a naive model always predicting 1.4mx(tx) as the total retweet count.
The other benchmark is Strawman Model that ignores f7, bf, and assumes that M
comes from a Poisson distribution (not binomial as before since fjX is unknown) with
global rate A. On the other hand, Zaman's Retweet Model assumes that each of the fiX
followers of user 0 will independently retweet with probability b, and therefore
number of one-hop retweets Mf follows binomial distribution Bi(fX, bf). Results
comparing the models are shown in Figure 2.5 and Table 2.1.
100
90
80
!4mx(V)
.
70
Strawman Model
<0
40
-
20
10
Retweet Model
010 o0
go
40
so
70 80
Y0
90
1o
Observed Fraction [9%
Figure 2.5 MAPE of Retweet Model vs Benchmarks with no knowledge on the
network structure model - Zaman et al. (2014)
Table 2.1 Log-likelihood(LL) and Deviance information criterion (DIC) for a 100%
observationfractionfor the full Retweet Model and a nested Strawman Model
- Zaman et al. (2014)
LL
DIC
Retweet model
Strawmanimodel
-38,860
83,848
-103,907
208,026
Zaman's model works in predicting the popularity of a single tweet, which is a smaller
granularity compared to our task for this thesis, predicting the popularity of a topic. Can
we generalize Zaman's one tweet level model to a topic level model?
One natural idea is that if we already know about the spread of one root tweet, our task
can be boiled down to predicting the spread of root tweets on a certain topic. However,
the difficulty is that there are large number of root tweets posted by root users, and
22
there's large overlap among followers of these root users. More complicated, when a
follower sees a root tweet, he/she can choose to retweet or post a root tweet him/herself.
So we cannot simply sum the retweet count for all root tweets talking about a topic,
otherwise there will be double counting, or even triple counting.
In the next section, we will look at a model useful at higher aggregation level, treating the
existing adopters (root tweets in this Twitter case) as a whole, not individually.
2.3. Bass diffusion models and extensions
2.3.1. Diffusion/Adoption process
Rogers (1962) developed the first diffusion model, and defined "diffusion of innovation"
as:
"The process by which an innovation is communicated through certain
channels over time among the members of a social system."
The adoption process of an innovation is the steps an individual goes through from the
time he hears about an innovation until final adoption (the decision to use an innovation
regularly). In the Twitter case, the adoption behavior is to tweet, retweet, or mention a
topic after seeing it.
An innovation could be any "idea, practice, or object that is perceived as new by an
individual or other unit of adoption" (Rogers, 2003). Rogers (2003) also provided four key
elements of the diffusion process - innovation, the social system which the innovation
affects, the communication channels of that social system, and time. Under this definition,
the spread of topics on Twitter can be treated as diffusion.
Potential Triers
Influence
of mass-media
communicatiou
Influence
Innovators
Imitatous
of WOV
Triers
Figure 2.6 Innovators and Imitators
Source: Slides from Bar-Ilan University (https://facultv.biu.ac.il/~fruchta/829/lec/ 6.pd)
23
New adopters
0
C
Imitators
E
z -Innovator
Time
Figure 2.7 Number of Innovators and Imitators over time
Adopters can be classified into innovators and imitators (Figure 2.6 and Figure 2.7).
Innovators obtain information from outside the social network, e.g. from mass-media
communication. Imitators, unlike innovators, are influenced in the timing of adoption by
the decisions of other members of the social system.
2.3.2. Basic Bass Diffusion Model
The Basic Bass Diffusion Model, or the Bass model, was first developed by Bass (1969),
and is the most widely used diffusion model. It consists of a simple differential equation
that describes the process of how new products get adopted in a population. Below is the
math formulation of the Bass model:
Number of
custoimers
who will
purcha';e
theproduct
at time I
p x Remaining
Potential
+
q x Adopters x
Remainng Potential
Imitation
Innovation
Effect
Effect
where
-
p is the coefficient of innovation, or the coefficient of external influence, denoting the
effect of the mass-media on potential triers (generation of innovators).
q is the coefficient of imitation, or the coefficient of internal influence, denoting the
effect of the WoM on potential triers (generation of imitators).
More exactly, if we use the following notations:
-
N(t) = the cumulative number of adopters of the product up to time t
m = the total number of potential buyers of the new product = N(oo)
n(t) = the number of customers who will purchase the product at time t
24
Then we have the most important equation in Bass (1969)
n(t)
dN(t)
=
q
p[m - N(t)] + -N(t)[n
m
dt
- N(t)]
Solving the above differential equation, with boundary condition N(O) = 0, we get the
expression of cumulative number of adopters
N(t)= m
1
-
e-(p+q)t
1+
e-(p+q)t
And other variables, including non-cumulative number of adopters n(t), time of peak
adoption t*, and number of adopters at the peak time n(t*) can be derived as follows:
p(p + q)2e-(p+q)t
[p + qe-(p+q)t]2
dN(t)
dt
t* = -
n
p+q
n(t*) = -(p
q
+ q) 2
4q
And the shapes of N(t) curve and n(t) curve are shown below in Figure 2.8.
Cumulative
Number of
Adopters at
Time t
Introduction
Time (1)
of product
Noncumulative
Number of
Adopters at Time I
Time (t)
Figure 2.8 The shape of N(t) curve and n(t) curve
25
Note that the Bass model holds under several assumptions:
- Diffusion process is binary (consumer either adopts, or waits to adopt).
- Maximum potential number of buyers (m) is a constant.
- Eventually, all m will buy the product.
- There is no repeat purchase, or replacement purchase.
* The impact of the WoM is independent of adoption time.
- Innovation is considered independent of substitutes.
* The marketing strategies supporting the innovation are not explicitly included.
2.3.3. Use the Bass model to fit or predict
Bass (1969) used his model basically for fitting n(t) - N(t) function, and getting the value
of m, p, and q. Using the fitted m, p, and q, the n(t) - t curve can be plotted, as shown
in Figure 2.9 (n(t) in this case means sales in from t - 1 to t. Details about the parameter
estimation process will be discussed in Section 0.
Actual I
Predicted
Year
Figure 2.9 Actual sales and predicted sales for room air - Bass (1969)
Since the Bass model has three parameters, m, p, and q, only three pairs of (n(t), N(t))
can fit a curve, to predict future sales. Figure 2.10 is the forecasting annual sales curve
fitted with the first three points, provided in Bass (1969).
However the parameter prediction with the first three points is not robust. Note that
N(t) is in a format of sigmoid function, S(t) = 1/(1 + e-). And if a sigmoid function is
shifted left or right, or scaled, the left side (where the first few points are located) will not
change much. As shown in Figure 2.11, there's not much difference between the three Scurves on the left-hand side. So intuitively, the prediction power of the Bass model is
limited, especially when we fit the curve with only points before the jump of S-curve.
26
7
0
a.,
5
a.
a.
E
S
Wl
1964
1968
1966
1970
YEAR
Figure 2.10 Predict Color TV sales based on the first three points - Bass (1969)
Sigmoid 1 = 1/(1+exp(-2x+10))
Sigmoid 2
--- o -Sigmoid 3
1.6
=
=
,
--
2/(1+exp(-2x+10))
1.5/(1+exp(-2x+14))
1.2
a-
*-'.'A
110-
0.8
0.4
u
2
10
8
6
4
Figure 2.11 Comparison of different sigmoid function
2.3.4. Bass curve shape with different p and q
In this section, we will show the shape of n(t) - t curve with different p and q. When q >
p, the peak time t* > 0, which means this is a successful product - the influence of WoM
(q) is greater than the external influences (p). On the other hand, if q < p, the product is
unsuccessful, since the influence of WoM is smaller than the external influences.
n(t)
n(t)
in
p
-i0)
/
Sr
111P
10
/
/
I
I
KK
t
0.2
JC.E
O..4
Figure 2.12 Shape of n(t) curve - q>p on the left, q<p on the right
)
Source: Slides from Bar-Ilan University (https://facultv.biu.ac.i/~fruchtg/829/ec/6. df
27
t
Low innovation (0.004) and lout conragion (0.001)
High
inovation (0.04), but /ou- contagion (0.001)
SamSal
2
34
Low
6
e 9 tO
112 2
!I it 17 !0 1 21
innovation tO.004),
3
2U
24
r25 V7 28 Z 30a
s
31N ' a 37 39 32 40
but high contagion (0.004)
....
342
:
T
High innovation (0.01)
k m7<33
and high contagion (0.004)
..................
.......
2' 2543
5aaa2 oaa 4mt
'.3.
3'42
33
1 1'61 123
'S
6
11233
13a21
431it1 14212'
3- 2226
2 32332U342 143
The*
Time
Figure 2.13 Shaper of n(t) curve with different values of p (innovation) and q
(contagion)
Source: Lecture notesfrom MIT Sloan
(http://www.mit.edu/hause r/Hauser%20PDFs/MIT%2OSloanware%20NOES/Note%20on%20Life%20Cycl
e%20Diffsion%2OModels.pdf)
2.3.5. The Generalized Bass model
The Generalized Bass model makes it possible to fit a more complicated curve than the
simple S-curve. Bass et al. (1994) gave an overview of variations of the Basic Bass model.
Most generalization can be written in the following format
n(t)
dN(t)
dt
q
m
p[m - N(t)] +-N(t)[n - N(t)]x(t)
where x(t) is the current marketing effort. For example, x(t) can be price (Robinson and
Lakhani, 1975), or advertising power. Solving the above differential equation with
boundary condition N(O) = 0, we have
-
m(l - e -Xot-X(O)(p+)
N(t) =m(
1 + q e-(X(t)-X(O)(p+q))
p
where X(t) = f x(u)du.
28
2.4. Integrating social network into economic diffusion models
There are some previous studies focusing on integrating social network into economic
diffusion models. Although not the same as our work, they still offer insights to our
research direction.
2.4.1. Two types of Twitter topics with different predictability
Naaman et al. (2011) found by hypothesis testing that there are 2 types of Twitter topics
with significantly different features, and therefore different predictability.
-
Exogenous trends: trends originating from outside of Twitter, e.g., earthquake, sports
games
Endogenous trends: trends originating from within Twitter, e.g., popular tweets
posted by Obama
Exogenous trends are hard to predict based only on Twitter data, since these topics are
raised purely by media outside of Twitter. On the other hand, endogenous trends are
relatively easy to predict based on Twitter data. Later in this thesis, we will show our topic
clustering results, which seem to be consistent with Naa man's conclusion.
2.4.2. Predict Twitter hashtag popularity by classification
Ma et al. (2013) predicted which Twitter topics will become popular one or two days later.
Twitter topics are annotated by hashtags. Ma formulated the problem as a classification
task, defined five ranges of peak day count: [0, 4)], [4, 24)], [24), 44)], [44), 8q5], and [84,
+00], and referred to these as being "not popular," "marginally popular," "popular," "very
popular," and "extremely popular," respectively. Based on the definition of popularity, Ma
applied 5 classification models - Naive Bayes, k-nearest neighbors, decision trees, support
vector machines, and logistic regression.
The metrics that Ma used to evaluate prediction accuracy is the Fl statistic
precision * recall
precision + recall
where precision (also called positive predictive value) is the fraction of retrieved instances
that are relevant; and recall (also known as sensitivity) is the fraction of relevant instances
that are retrieved.
Table 2.2 lists the features Ma used to classify, including 7 content features extracted
from a hashtag string and the collection of tweets containing the hashtag, and 11
contextual features extracted from the social graph formed by users who have adopted
the hashtag. And Ma observed that contextual features are more effective than content
features.
29
relevant elements
false negatives
true negatives
0
*
0
0
true positi ve,
w
yelect
V itm r .lected
0
Precision
0
=
Recall
-
0
selected elements
Figure 2.14 Precision and Recall (source: wikipedia)
Table 2.2 Features used in Ma et al. (2013) - 7 content features (Fe) and 11
contextualfeatures (F,)
Fa
F,
F"
X4
,
F.
FW
FIv
F
F,
DEscripion
Confainingiigits Binary attribute checking whether a hashtag contains digits
Seg WordNu 1
Number of segment words from a hashtag
Fraction of tweets containing URL in 7
URLFrac
SeatimenrVector 3-Dimension vector: ratio of neutral. positive. and negative tweets in 7
20-Dimension topic distribution vector derived from T using topic model
TopicVector
HashtagClaritv Kullback-Leibler divergence of word distribution between T and tweets collection T
Se, WordClarhivy Kullback-Leibler divergence of word distribution between tweets containing any segment word in h and tweet collection T
UserCount
Number of users Ulb1
Number of tweets I;7
iveetsNamn
Fraction of tweets containing mention @
Replyfrac
Fraction of tweets containing RT
RetweeFrac
AveAuthoritV
Average authority of users in G
TriangleFrac
Fraction of users forming triangles in G,"
GrapluDensit
Density of G.
CompatentRatio Ratio between number of connected components and number of nodes in G
Aw'F.4eStreagh
I Average edge weights in GC
Border(IserCoant Number of border users
15-Dimension vector of exposure probability P(k)
EiVposureVrtor
*
Feature
2.4.3. Apply Diffusion Models to Predict the User Amount of Social Network
Websites
The literature discussed in this section applied diffusion models on the whole social
network website, instead of a certain topic. Chang (2010) suggested to apply diffusion
30
models in studying Twitter hashtag adoption, and later Wang (2011) completed the
research, applying the Bass model to the Twitter diffusion process. Predicted results are
shown in Figure 2.15.
Cumulative Adopters by Year
The Bass Model
4)00000000
'OC1000I
1>o :2006
W
-(r
Atw~pler3
3
2006
2011
2011
7
4
44
04?11?
1OOMO
vsJO
('eft
te
-O WOng (211
(rgt
Month since Dec.2W0)
Figure 2.15 Twitter Cumulative Adoptions: 5 Years since Introduction- True
(left) vs Fitted (right) - Wang (2011)
In addition to Twitter cumulative adoptions, Bauckhage et al. (2014) investigated patterns
of adoption of 175 social media services and web businesses, and predicted Google search
frequencies for these social media websites. From Google Trends, Bauckhage collected
search frequency of queries such as "Twitter," "eBay," "Facebook," or "YouTube" that
indicate interest in social media services. And the model Bauckhage applied was several
economic diffusion models, including the Gompertz model, the Bass model, and Weibull
model. Results showed that the diffusion models are found to provide accurate and
statistically significant fits to the data, and the Gompertz model did the best among the
three diffusion models. And collective attention to social media grows and subsides in a
highly regular manner.
Figure 2.16 is the fitted and predicted curves by three diffusion models, for keyword
"YouTube" and "Twitter." Solid lines fit the true value predicted by Google Trends (true
value), and dotted lines predict the future Google search frequency for "YouTube" and
"Twitter."
-
Google Trends
Weibull
-
Bass
shifted Gompertz
-
Googo Trends
Weibull
-
-
Bass
shifted Gompertz
100-10
100600
20-
20
Figure 2.16 Predict future Google search frequency for "YouTube"(left) and
"Twitter"(right) - Bauckhage et al. (2014)
31
2.44. Predict hashtag/keywords popularity when the popularity is defined
continuously
Different from the model discussed in Section 2.4.2, when popularity is defined
continuously, classification model does not work anymore. Wang (2011) defined the
popularity of hashtag as the percentage of Twitter trending topics denoted by hashtags
such as "#JeremyLin" (can also be denoted by keywords such as "Jeremy Lin"). Wang then
forecasted this percentage by using the Simple Logistic Growth Model
y(t)
L
=
+ ae-bt
where y(t) is the market share (or adoption rate), L refers to the saturation level, and a
and b describe the curve.
Hashtag popularity can also by defined as the online mention count on a Twitter hashtag
or a news keyword, proposed by Yang and Leskovec (2010). They collected Twitter
Hashtags, and phrases in blogs and news from five types of data source shown in Table
2.3. Using Linear Influence Model, Yang and Leskovec predicted mention count on a
certain hashtag in the next day The Linear Influence Model models the volume of diffusion
over time as a sum of influences of nodes that got "infected" beforehand. For each node
(u, v, or w in Figure 2.17), an influence function estimated and is used to quantify how
many subsequent infections can be attributed to the influence of that node over time.
And the mention count in the next day = the sum of all influence function. Influence
function could be an exponential function I(l) = cue-u'.
Yang and Leskovec also investigated what the influence function looks like across different
fields (e.g. politics, entertainment, business, sports), as shown in Figure 2.18, and found
that patterns of influence of individual participants differ significantly depending on the
type of the node and the topic of the information.
Table 2.3 Five Data Sources for Wang (2011)
Type
Website
Newspaper
nytimes.com
online.wsj.com
washingtonpost.com
usatoday.com
boston.com
huffingtonpost.com
salon.com
Professional blog
32
TV
cbs.com
abc.com
News Agency
reuters.com
ap.org
wikio.com
forum.prisonplanet.com
blog.taragana.com
freerepublic.com
gather.com
blog.myspace.com
leftword.blogdig.net
bulletin.aarp.org
forums.hannity.com
wikio.co.uk
instablogs.com
Blogs
A
Volume
u
V
VV
Time
Figure 1. The Linear Influence Model models the volume of diffusion
over time as a sum of influences of nodes that got "infected" beforehand.
Figure 2.17 The Linear Influence Model
33
-News
-New's
(89.6)
--- PB ( 19.9)
TV (39.9)
-0---Agency (151.1)
-+-Blog (91.,)
60'
-TV (40.7)
40
Agency (108.3)
--
30
Blog (60.9)
S40
30
20
20
0
10
.
i . .
70
(77.4)
--- PB (107.3)
50
1
3 4 5 6
Time (hours)
2
7
0
8
.
.
1
2
40
70
-- News (92.1)
- PB (75.5)
TV (31.5)
-Agency
(118.9)
(58.8)
50
-
8 9
)1
50
40
.
3 4 5 6 7
Time (hours)
8
(c) Entertainment
30
-News (74.0)
- -PB (77.7)
-TV (50.2)
Agency (139.3)
-Blog (139.0)
60
2
-News
(50.9)
- -PB (66.8)
TV (31.7)
--"Agency (41.2)
- Blog (22.4)
25 IL
20
15
-
30----Blog
7
-News
(34.4)
- -PB (80.8)
TV (56. 1)
' Agency (7 1.1)
-- Blog (103.1)
(b) Nation
(a) Politics
'IW.
5 6
34
Time (hours)
4U,
35
30
25
020
S 15
10
5
C
20
20
10
10
0.
.
U
..
I
2
.-
1
2
3 4 5 6
Time (hours)
(d)
Business
7
8
9
0
4
Time (hours)
7
8
9
-0
2
3 4 5 6
Time (hours)
7
8
9
(fi Sports
(e) Technology
Average influence functions of five types of websites: Newspapers lNews). Professional Blogs
and Personal Blogs (Blogs). The number in brackets denotes the total influence of a media type.
1
(PB).
Tekvision (TV). News Agencies
(Agencyl,
Figure 2.18 Influencefunction across differentfields - (a) Politics, (b) Nation, (c)
Entertainment, (d) Business, (e) Technology, (f) Sports
34
3. Data Overview
In this section, we will give an overview of the data used in this research, including the
data source, data size, raw data format, and data processing which is necessary before
modeling.
3.1. Definitions
The concepts used in the research are listed below:
Topic: specified by a hashtag (e.g. #WorldCup) or keywords (e.g. Jeremy Lin).
Tweet: text posted on Twitter by its users. A tweet is either a root tweet, retweet, or
reply.
- Mention count: number of tweets mentioning the topic, or in other words, including
the hashtag or keywords in its text.
- Root tweet: original tweet created and posted by the user
* Retweet: tweet forwarded, not originally created
- Reply: reply under tweets posted by other users
- Peak day: the day in which the topic "exploded" on Twitter, with the most daily
mention count over the time horizon.
- Peak count: mention count on the peak day, for a certain topic.
- Influencer (Influential User): Each Twitter user has an influential score calculated by
Topsy. It measures the likelihood that, each time the user says something, people will
pay attention. Influence for Twitter users is computed using all historical retweets.
"Influential" tags appear for the top 0.5% most influential Twitter users.
- Adopters (of a certain topic): users who already tweeted about a certain topic.
*
3.2. Data source - Topsy API
Data in this thesis were collected using Topsy API (Application Programming Interface).
Topsy ( http://topsy.com/ ) maintains all tweets since Twitter's inception in 2006. In
addition to tweets as text, Topsy transforms the raw tweets and users' information into
numbers that are easier to use, such as number of tweets mentioning a certain keyword
in a certain period of time, number of influential authors who talk about a given term.
Figure 3.1 shows the homepage of Topsy.com, where you can type in the keywords you
are interested in, e.g. "iWatch." And then the tweets per day on the keywords will be
shown as a graph (Figure 3.2). It is even available to compare tweets on different
keywords on the same graph (Figure 3.3).
35
Figure 3.1 Topsy Homepage
Tweets per day: iwatch
4
/
K'
'K
A
K'
~
]
Figure 3.2 An Example of Using Topsy - Tweets on 'iWatch' per day
Tweets per day: iWatch and iPhone 6S
?K
ANAL
YrTcC S
Figure 3.3 Tweet Count on "iWatch" v.s. "iPhone 6S"
36
Table 3.1 Tops) API
API Category
Content APIs
Metrics APIs
Insights APIs
Remarks
API Name
Tweets
Top tweets for a set of terms and filters
Bulk Tweets
Tweets in bulk for a set of terms and filters
Streaming Tweets
Stream of tweets matching a set of terms and
filters
Photos
Top photos for a set of terms and filters
Links
Top links for a set of terms and filters
Videos
Top videos for a set of terms and filters
Citations
Time-ordered tweets or retweets referencing a
tweet, link, photo, or video
Conversation
Reply thread for a tweet
Tweet
Look up tweets by tweet ID
Validate
Check whether a tweet is still valid (has not been
deleted by user) by tweet ID
Location
Set of places that start with a given string, to be
used with /metrics/geo
Mentions
Number of tweet mentions by time slice for any
term
Citations
Number of total citations (tweets, retweets, and
replies) for a particular URL
Impressions
Number of potential impressions by time slice for
any term
Sentiment
Topsy Sentiment Score (0-100) by time slice for
any term
Geo Distribution
Number of mentions by country, state/province,
county or city
Related Terms
Phrases, hashtags, terms,
mentioned a given term
Influencers
influential authors who talk about and amplify a
given term
Author Info
Information about a Twitter account
Source: http://api.topsy.com/doc/
37
or
authors
co-
Data can not only been seen on Topsy's website, but also available for batch "scraping"
by Topsy API. Table 3.1 listed the information that can be obtained by Topsy API.
In our research, we applied one of the above three APIs - Metrics API - to collect tweet
count on a certain topic every hour.
On the basis of the APIs, Topsy also provides filter parameters (Table 3.2) which make it
possible to collect tweets by time, location, language, or tweet type (root tweet, retweet,
or reply).
Table 3.2 Filter parameters of Topsy API
Name
Description
mintime
Start time for the report in Unix timestamp format.
maxtime
End time for the report in Unix timestamp format.
region
Show results for the specified locations only. A valid region integer ID
must be used.
lationg
Show results from tweets that are geotagged with latitude/ longitude
coordinates only.
allowjlang
Show results in the specified language only. Currently supports 'en'
(English), 'zh' (Chinese), 'ja' (Japanese), 'ko' (Korean), 'ru' (Russian), 'es'
(Spanish), 'fr' (French), 'de' (German), 'pt' (Portuguese), and 'tr'
(Turkish).
sentiment
Show results with the specified sentiment only. Valid values are 'pos'
(Positive), 'neu' (Neutral) or 'neg' (Negative).
infonly
Show results from influential users only.
tweet-types
Show results only of the specified tweet type. Supported values are:
'tweet', 'reply', 'retweet'.
Source: http://opi.topsy.comn/doc/filter-parameters/
3.3. Data size - number of topics and time horizon of each topic
We collected data of 220 topics covering a wide range of fields, from 2009 to 2014. The
list of topics is shown in Appendix A, with the peak day and peak count of each topic.
Twitter published its trending (popular) topics every year on its websites, e.g.
38
*
2012 Twitter trending topics: https://2012.twitter.com/en/trends.html
*
-
2013 Twitter trending topics: https://2013.twitter.com/#category-2013
2014 Twitter trending topics: https://2014.twitter.com/moments
For each topic, there's a peak day with the most daily mention count. The length of time
horizon of each topic is determined to be 180 days, about 6 months, from 120 days before
the peak day, to 60 days after the peak day. For example Figure 3.4 shows the time
horizon of Topic 89 - iWatch. We first detected the mention count of "iWatch" reached
to a peak on Sep 9 th, 2014; then the time horizon for this topic was determined as May
1 3 th to Nov 8 th 2014.
0D
o 14
0
12
07
100
10
E
8
_
0
1
peak dayc
2014/9/9
9
8
7
~0
0
6
5
E
2
E
6
4
0
non-cumulative mention
count
x
+.
100
Cumulative mention count
--
+
016
Y
2
0
0
Figure 3.4 Time Horizon and Mention Count of Topic 89 - iWatch
3.4. Raw data and its format
There are two parts of raw data collected by Topsy API. One with a time unit of one hour,
including the variables in Table 3.3 for each topic.
Table 3.4 shows data in part 1 for topic 89, which contains the 7 variables every hour for
this topic.
The second part of raw data was collected by every 5 days, and is consisted of three sheets,
shown in Table 3.5, Table 3.6 and Table 3.7. In each of the three sheets, each row
represents a topic, each column represents a variable in every five days. . There are 180
39
days/5 day = 36 columns in each sheet. For example, in Table 3.5, the number in the 1 st
row 1 st column is the count of influencers mentioning topic 1 in day 1~5.
Table 3.3 Raw Data Part 1 (collected by hour) - Variables for each topic
Variable Name
Unix timestamp
Mention count
Remarks
Denoting the start of the hour
Mention count of a topic per hour, equals to the sum of root
tweet, retweet, and reply
Root tweet count
Retweet count
Reply count
Each Twitter user has an influential score calculated by Topsy. It
Mention count by
influential users only
measures the likelihood that, each time the user says
something, people will pay attention. Influence for Twitter users
is computed using all historical retweets. "Influential" tags
appear for the top 0.5% most influential Twitter users.
Sentiment score
Ranging from 0 to 100, where 0 means negative and 100 means
positive. Determined by Topsy API.
Impressions
Number of users being reached by this topic (these users have
Twitter feeds including this topic).
Table 3.4 Raw Data Part I - Topic 89 (iWatch) as an Example
Unix
timestamp
1399917600
1399921200
1399924800
Mention
count
38
42
25
Root
tweet
count
32
36
18
Retweet
count
3
5
5
Reply
count
3
1
2
Mention count
by influential
users only
1
1
1
1415462400
31
25
6
0
1415466000
7
6
0
1
40
Sentiment
score
60
Impressions
14181
92
76
44378
20994
3
70
22729
0
32
7030
Table 3.5 Raw Data Part 2 - Sheet A: Count of influencers mentioning the topic
Topic ID
Day 1-5
Day 6~10
...
Day 171-175
Day 176-180
1
1000
1000
997
999
2
0
0
999
999
3
999
1000
...
...
...
1000
998
219
485
284
...
332
268
220
909
1000
...
1000
1000
Table 3.6 Raw Data Part 2 - Sheet B: Average follower count of influencers
mentioning the certain topic
Topic ID
Day 1-5
Day 6~10
1
2
6290.815
15039.53
.
3
219
220
...
Day 171~175
Day 176-180
.
...
...
26423.54
1379.78
6383.804
3913.21
4565.331
2849.977
...
5527.489
22899.03
21519.15
1413.029
31357.66
17697.21
...
...
21519.15
1413.029
31357.66
17697.21
Table 3.7 Raw Data Part 2 -Sheet C: Sum of the influential levels of all influencers
Topic ID
Day 1-5
Day 6-10
...
Day 171~175
Day 176-180
1
7250
7503
...
7289
7670
2
0
0
...
1356
1398
3
5438
5213
...
6405
6313
219
448
424
...
875
903
220
611
910
...
792
548
3.5. Data processing before modeling
There are missing data problems in the 2 nd part of raw data, shown in the above three
tables with a dot ."." For example, in Table 3.6 (sheet B), topic 2 has no information about
follower count in the first 10 days. We simply treated these missing values as 0 in our
41
model, which makes sense because in this case, missing data means there's no influencers
talking about topic 2 in the first days, and therefore there's 0 follower count.
The next step of data processing is to create features (independent variables) and
dependent variables for machine learning, which will be introduced in Section 4.1.
3.6. Other data sources providing Twitter data
This section will introduce other data sources providing Twitter data that readers might
be interested in, although they were not used in this research.
Similar to Topsy, Keyhole ( http://keyhole.co/) is another website on which we can search
tweets on certain topics, breakdown by location, users' gender, and many other filters.
Figure 3.5 is the page returned by Keyhole on the keyword "iWatch."
Twitter itself has API (https://dev.twitter.com/overview/documentation ), offering
information related to its users, tweets, hashtags and locations.
669
467
1175108
794,410
' s ."..t.
F..K y
Figure 3.5 Keyhole's Results for "iWatch"
42
4. Models
In this section, we will introduce the two types of models applied to predict Twitter topic
popularity respectively - machine learning regression models and the Basic Bass model.
4.1. Dependent and Independent Variables in Machine Learning
Models
Dependent variable in our models is the total mention count within 180 days, from 120
days before the peak day, until 60 days after the peak day. Consistent to the notation in
the Bass model introduction (Section 2.3), let N(180) denote the dependent variable.
Figure 4.1 shows the wide distribution of the total mention count. And there are 97 topics
having a the total mention count between 10s and 106, which is the mode range.
LU
TOTAL MENTION COUNT
Figure 4.1 Distributionof the total mention count (220 topics)
The task of our models is to predict dependent variable, based on independent variables
(features) calculated from data within the first x days (0 5 x 180). We have 12
features, and we use fi(x) to denote the ith feature calculated based on data within the
first x days. Meaning of each feature is shown in Table 4.1. For example, f 1 (15) is the
average mention count per hour within the first 15 days. Note that all features are
normalized between [0, 1] before model estimation.
In our models, we denote the set of possible values as Firstdays = {15, 30, 45, 60, 75, 90,
100, 110, 120, 130, 140, 150, 165, 180}, and x can take any one value in Firstdays.
Intuitively, larger x leads to better prediction because the models knows more with larger
x.
43
We have two ways of using the 12 features. Take the 1 l feature, average mention count
per hour, as an example. If t = 45 day
-
Not Merge: We can use the average mention count per hour in the first 45 days
(f,(45)). Then there will be 12 features in the model; or
Merge: The average mention count per hour in the first 15 days, 30 days, and 45 days
(f1 (15),fi(30), and fl(45)). Then there will be 12*3 = 36 features in one model,
namely, merging all features before t.
Table 4.1 Independent variables (features)
f1 (x)
Average mention count per hour
f 2 (x)
Growth rate of mention count: OLS slope of mention -time line
f3 (x)
% of root tweet = root tweet in first X days / mention count in first X days
f4(x)
% of retweet
fs(x)
Average mention count per hour by influential users only
f(x)
Average impressions per hour
f 7 (x)
Average sentiment score
f8 (x)
f 9 (x)
Extreme degree of sentiment = avg(I sentiment score(t) - 501)
Growth rate of sentiment score: OLS slope of sentiment - time line
f,0 (x)
Average Influencer count (max limit = 1000) per 5 days
f 1 (x)
Average follower count per influencer
f 1 2 (x)
Sum of Influencers' influence level
"I.2
MErics used to evaluate nmodels
To evaluate the forecasting accuracy of our modes, we have two choices of metrics:
1) Mean relative error = I(Forecast - True)/True I
2) Mean squared error = (Forecast - True)A2
Considering the wide distribution of the total mention count and the huge difference
between mention counts of different topics, relative error is preferred, while MSE is too
sensitive to topics with large mention count. So we define predicting error as the mean
relative error, I (Forecast - True)/True I averaged over all topics.
44
4.3. Machine Learning Regression Models
The 220 topics of data are separated into a training set of 184 topics and a testing set of
36 topics. Roughly 1/6 of data are used for testing. In this section, we will introduce the
four machine learning models we used to predict the total mention count of a Twitter
topic. They are k-nearest-neighbor, linear regression, regression tree, and ensemble.
4.3.1. K-nearest-neighbor with feature selection
The first machine learning model is k-nearest-neighbor (KNN). We applied 4 variations of
KNN, as shown in Table 4.2, all of which applying the following method: for each x, search
for the best combination of K and feature subset. The 4 KNN variations use features in
two ways (Section 4.1), and use two model templates to forecast a testing point, given
the K nearest neighbor points:
-
Template 1: KNN - Average of all K neighbors' total mention counts
*
Template 2: Weighted KNN - Weighted neighbors' mention count by 1/distance
Table 4.2 Variations of KNN
Machine Learning Model Template
The way of using features
Model ID
1
2
3
4
KNN
Weighted KNN
KNN
Weighted KNN
Not merged
Not merged
Merged
Merged
We traverse K from 1 to the Rule of Thumb (=
trainingset size ~ 14). For each K,
search for the optimal feature subset, and then calculate the corresponding relative error.
Finally pick the K and the corresponding feature subset with smallest relative error.
Feature subset was implemented by sequential feature selection algorithm (Matlab
function "sequentialfs"), in order to avoid overfitting with too many features. Sequential
feature selection method has two components:
-
-
An objective function, called the criterion, which the method seeks to minimize over
all feasible feature subsets. In our case, the criterion is relative error averaged over
topics. For each candidate feature subset, sequential feature selection method
performs a 5-fold cross-validation to calculate the criterion.
A sequential search algorithm, which adds or removes features from a candidate
subset while evaluating the criterion. Since an exhaustive comparison of the criterion
value at all 2" subsets of an n-feature data set is typically infeasible, sequential
searches move in only one direction, always growing or always shrinking the
candidate set.
45
4.3.2. Linear regression with feature selection
The 2 nd machine learning model is linear regression with feature selection. Table 4.3
shows the two variations of linear regression models.
Table 4.3 Variations of Linear Regression models
Machine Learning Model Template
Model ID
5
Linear regression
6
Linear regression
The way of using features
Not merged
Merged
To avoid overfitting with too many features, feature selection was implemented by
stepwise feature selection algorithm, adding a feature when its parameter has a p-value
> 0.5, and removing a feature when its parameter has a p_value 0.1. At each step, at
most one feature can be added, and at most one feature can be removed.
4.3.3. Bagged regression tree
The 3rd machine learning model is bagged regression tree model. We have two variations
of tree models, shown in Table 4.4. Bagging is applied to obtain a more robust and stable
tree model. Bagging is also called "bootstrap aggregation," and is a type of ensemble
learning. Bagging algorithm generates many bootstrap replicas of this dataset and grows
decision trees on these replicas. Each bootstrap replica is obtained by randomly selecting
N observations out of N with replacement, where N is the dataset size. To find the
predicted response of a bagged tree, take an average over predictions from individual
trees. In our case, we bagged 10 weak learners (regression trees).
Table 4.4 Variations of Tree models
Machine Learning
Model ID
7
8
Model Template
The way of using features
Bagged tree
Bagged tree
Not merged
Merged
4.3.4. Ensemble model
Different models outperforms at different time stage, which will be shown in Section 5.
To utilize advantages of all models, and to mitigate error peaks and fluctuations, the last
machine learning model - ensemble model - takes an average of over KNN average, LR
average, and Tree average as its forecasted value, where
46
-
KNN average = average forecast value over all 4 KNN variations
LR average = average forecast value over the 2 LR models
Tree average = average forecast value over the 2 tree models
4.4. The Basic Bass model
4.4.1. Model formulation
Section 2.3 introduced the idea and format of the Bass model and its extensions. With our
Twitter data, we will fit the Basic Bass model, specifically the following equation
-
N(t) = m 1
1 +1 Ie -(V+qN
where
-
N(t): cumulative number of tweets mention the certain topic up to time t
-
m = N(oo): total number of tweets when time goes to infinity
p: the coefficient of innovation, or the coefficient of external influence, denoting the
effect of the mass-media (outside Twitter) on potential adopters.
q: the coefficient of imitation, or the coefficient of internal influence, denoting the
effect of the existing tweets on the certain topic
-
Take x = 45 as an example, at the time of day 45, we have a list of (t, N(t)) pairs for t <
45. This list is used to fit the N(t) - t curve to get the estimated m, p, and q. Figure 4.2
takes topic 4 as an example, showing true (solid line) and fitted (dashed line) curve of N(t),
the cumulative mention count.
Topic 4 - #ChiefKeefMakesMusicFor - 180days
-
8000
0 6000
E0
6000
0
E 2000
E.
-
0
so
100
day
150
-f
te
200
Figure 4.2 True vs Fittedcumulative mention count curve
47
4.4.2. Three time aggregation levels - Fitting error and Forecasting error
We tried three time aggregation levels - mention count is aggregated to count per hour,
count per day, and count per 5 days. This aggregations level is denoted by intv, and takes
three values, 1hr (1/24 day), 1 day, 5 days. Take the aggregation level of 1 day as an
example, we feed the model with (N(t), t) pair for t = 1 day, 2 days, ... , 180 days. With all
data (x = 180 days) fed to the model, smaller aggregation level means more points on the
curve being fed to the model, and therefore leading to better fitting accuracy (Figure 4.3),
where fitting error (MSE) is defined as
fitting error =
180
1
2'
(N(t) -(Nt)
180/intv
t=intV
When x < 180 days, during the fitting process for optimal m, p, and q, the objective
function to minimize is
fitting error =
1
x/intv
N(t)-IR~t)
2
t=intv
Topic 4 - #ChiefKeefMakesMusicFor - 180days
8000
0 6000
E 4000
.......
truel
E 2000
- -1hr
E~
1d
5d
L
0
20
40
60
100
80
120
140
160
180
day
Figure 4.3 Compare fitted curves with aggregation level of Ihr, Iday, and 5 days
Note that our task is not fitting, but predicting the total mention count within 180 days
based on data within the first x days. Recall that predicting error is defined as the mean
relative error, I(Forecast - True)/True I averaged over all topics. And fitting error only
makes sense when x = 180 days. When x < 180, say x = 90, the Bass model fits the first
half of the N(t) curve. Although smaller granularity leads to smaller fitting error for the
first half of curve, when the curve extends to x = 180, the predicting error might be bigger.
48
Smaller granularity leads to violent fluctuation, and when the model tries to fit every
points on the fluctuating curve, and extend the curve to day 180, the resulting estimate
of the total mention count might be far from the true value. Later in Section Error!
Reference source not found., we will compare the predicting error of the three time
aggregation levels.
4.4.3. Define the predicted cumulative mention count
The true value of our dependent variable is N(180), cumulative mention count within
180 days.
1
-
1+
e
+
With the fitted parameters m, p, Q, we can predict cumulative mention count. Define the
fitted curve R(t) as
1Ae-(i4
p
We have two choices, R(180) or M', as the predicted cumulative mention count within
180 days. m (= N(oo)) is the total mention count when time goes to infinity. If we assume
the horizon of 180 days covers the whole diffusion process, then m (= N(oo)) can be a
predicted value for the total mention count. Later in Section Error! Reference source not
found., we will discuss which one of R(180) or ' is a better predicted value.
4.5. Modeling on a subset of topics
Besides running our models on all 220 topics, we will also modeling on a subset of topics,
since some topics are exogenous trends (Section 2.4.1) originating from outside of Twitter,
e.g., earthquake. These topics and are fundamentally unpredictable with only Twitter
data.
Figure 4.4 plots the non-cumulative mention count every 5 days for topic 26 and topic 89.
Peak (day 120) of Topic 26, "pandaAl," was generated by a news about a panda named Ai
faking pregnancy. There were no tweets on this topic before the news release, and
therefore its total mention count cannot be forecasted without the news information
outside Twitter. Topic 89, "iWatch," was announced by Apple's CEO Tim Cook on
September 9, 2014, which is consistent to the peak day for this topic on Twitter. But
before the peak day, there were already "rumors" about the release of iWatch, explaining
the fluctuation on the mention count curve at early stages. Before peak day, mention
count of topic 26 seems to be harder to predict, compared to topic 89. That is the reason
motivating us to find the subset of topics that are easier to predict.
49
In this section, we will try to detect which topics are exogenous trends, and then in Section
5.4, we will model on endogenous trends only, and see if we can improve the predictive
accuracy.
Topic 26 - #pandaAl - noncumulative mention count
1200
0
1000-
0
800
E
600-
$ 400
E
E
200
0
20
0
40
80
60
100
120
140
160
180
day
x 1 05
Topic 89 - iwatch
-------------------
-
C
2
non-cumulative mention count
----
E
0
.-C 1.5
E
Peak day:
Sep 9, 2014
7,
E
E 0.5
C
0
-
0
_-4
--- --
20
...............
.....
...
. w
- ----------
40
60
-
C
80
100
day
120
140
160
180
Figure 4.4 Non-cumulative mention count of topics with/without fluctuation
before peak day
4.5.1. Determine the number of clusters
We will separate the group of exogenous trends and the group of endogenous trends by
clustering. Intuitively, exogenous trends have less fluctuation before peak day (topic 26
in Figure 4.4), and endogenous trends have more (topic 89 in Figure 4.4). So the clustering
will be applied on the vector of mention count every 5 days, which represents the
diffusion pattern of each topic, shown in Table 4.5. Each row represents a topic (220 rows),
and each column represents mention count within certain 5 days as a percentage of total
mentions in 180 days (180/5 = 36 columns).
50
Table 4.5 Percentage mention count every 5 days (vector being clustered)
Time range
Day 1-5
Day 6-10
Day 11-15
...
Day 166-170
Day 171-175
Day 176-180
Topic 1
0.534%
0.153%
0.109%
...
1.897%
0.894%
1.199%
Topic 2
0.498%
0.299%
0.149%
...
11.211%
9.367%
5.182%
Topic 3
0.000%
0.000%
0.000%
...
0.000%
0.000%
0.000%
Topic 4
0.000%
0.001%
0.000%
...
0.193%
0.120%
0.249%
Topic 220
0.082%
0.097%
0.375%
...
0.084%
0.110%
0.113%
To detect the natural underlying number of clusters (number of diffusion patterns), we
will apply 5 different methods to determine how many clusters we should have.
The 1 st method is applying Principal Component Analysis to reduce the dimension of the
percentage mention count vector from 36-dim to 2 or 3-dim, for visualization. In Table
4.6, principal component variance is the eigenvalues of the covariance matrix of the 36dim mention count vector. Dividing the variance by the sum of all 16 variances, we get
that
*
*
*
*
The first
The first
The first
The first
The first
2 principal components capture 83.1% of total variance
3: 87.9%
6: 94.6%
10: 97.1%
18: 99.2%
Then the first 2 and first three principal components are plotted in Figure 4.5 and Figure
4.6. Visually, there is not much insights on the underlying number of clusters.
The 2 nd method is Elbow method, which is a heuristic one. Do K-means for different K.
Examine the within-cluster dissimilarity (= average within-cluster point-to-centroid
distance) as a function of K. At some value for K the cost drops dramatically, and after
that it reaches a plateau (Figure 4.7). This is the optimal K value. Rationale behind Elbow
method is: after the optimal K which is consistent to the natural clustering, the new
cluster is very near some of the existing. Limitation with Elbow method is that this "elbow"
cannot always be unambiguously identified. Sometimes there is no elbow, or several
elbows (Figure 4.8).
With our Twitter data, we define cost function = average within-cluster point-to-centroid
distance. Results are plotted in Figure 4.9. Optimal K is probably 2.
51
Table 4.6 Principal components of mention count vector
Principal
Component
Variance
Principal
Component
Variance
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0.09103
0.023717
0.006545
0.004156
0.003485
0.00168
0.001163
0.000984
0.000649
0.000615
0.000524
0.000457
0.000439
0.000406
0.000336
0.000302
0.000276
0.000179
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
0.000159
0.000138
0.000133
0.000115
0.000102
7.55E-05
7.33E-05
6.17E-05
5.23E-05
4.95E-05
4.45E-05
3.57E-05
2.75E-05
2.45E-05
1.95E-05
1.77E-05
1.33E-05
1.11E-33
0.5
CU
CL
0.4
-4
0.3
..4
E 0.2
0
01
-
~6
0
-4
C
-0.1
44
44
4444
4444
44444
-0.2
.4
-0.4
4
-0.2
0.2
0
0.4
1st principal component
Figure 4.5 Plot the First 2 Principal Components
52
0.6
02
0.4
-0.2
00.22
.-
0
0
0,
00
dC-0.2
20.2
.2nd Principal Component
ItPicplCrpnn
1st Prncipal Component
Figure 4.6 Plot the First 3 Principal Components
C
"Elbow"
t
3
2
1
4
7
6
S
9
K (no. of clusters)
Figure 4.7 Illustration of Elbow method
No visible "elbow"
.
in this plot
0
1
2
3
4
5
6
7
8
K (no. of clusters)
Figure 4.8 Elbow method when the "elbow" cannot be identified
53
35
30
S25
0
u 20
15
0
1
2
3
4
5
6
7 8
9 10 11 12 13 14 15 16
K (# of clusters)
Figure 4.9 Elbow method cost as a functionof K, with our Twitter data
The 3 rd method to determine the number of underlying clusters is information Criteria,
specifically Akaike's information criterion (AIC) and Bayesian information criterion (BIC).
Note that Elbow method will always have smaller cost function for larger K, while
Information Criterion overcomes this problem by adding cost penalty for large K. Then
the smallest cost value will correspond to the optimal K. The method of Information
Criterion follows the three steps:
Step 1: Fit Gaussian mixture distribution with K components to data. To avoid illconditioned covariance estimates (columns of input 220*36 matrix might be linearly
correlated), we reduced dimension of input vector from 36 to 6 by PCA. Recall that the
first 16 principal components capture 94.6% of the total variance, and are enough to
represent the original 36-dim vector.
Step 2: Calculate likelihood as: Likelihood = Pr(data I Gaussian mixture distribution)
Step 3: Calculate information criteria as follows
-
AIC = -2
BIC = -2
* In(likelihood) + 2 * k
* ln(likelihood) + ln(N) * k
where N is the # of topics.
Figure 4.10 shows the information criterion as function of K, and optimal K = 5. Note that
we restricted K to be no larger than 6, because K cannot be too large, or there will be illconditioned covariance estimates.
54
-2
+
S-2.5
AIC
BIC
0
-3.5
-44
-5
1
3
2
4
5
6
K (# of clusters)
Figure 4.10 Information criteria with the first 6 principalcomponents of mention
count vector
The 4 th method is Silhouette Method. Its basic idea is to compare within-cluster distances
with between cluster distances, which is equivalent to adding penalty to large K, as the
information criteria. The greater the difference, the better the fit. Silhouette width s(i)
for data point i is defined as
-
b(i) - a(i)
max(a(i), b(i))
where
-
a(i) is the average distance between i and all other points in the same cluster to
which i belongs;
b(i) is the minimum of the average distances between i and all the points in each
other cluster.
The silhouette width ranges from -1 to 1. When s(i) ~ 0, the point could be assigned to
another cluster as well.; when s(i) ~ -1, the point is misclassified; when s(i) ~ 1, the
set of data points is well clustered. A clustering can be characterized by the average
silhouette width of individual entities. The largest average silhouette width, over different
K, indicates the best number of clusters. Figure 4.11 shows that the optimal K = 2.
55
0.65
0.6
055
0.4
2
4
6
10
8
K (# of clusters)
12
14
16
Figure 4.11 Average Silhouette with our Twitter data
The
5th
method is Gap criterion, where the gap value is define as
Gap value = log(Expected WK) - log(WK)
where
-
Wg is the within-cluster dissimilarity; and
-
Expected WK is obtained from data uniformly distributed over a rectangle
containing the data.
Then the optimal K is
K* = argmin{KIG(K) ;> G(K + 1) - s 1 }
K
where sk
=
SKJ1 /20, and SK is the standard deviation of log(WK).
Figure 4.12 is an example of applying Gap criterion to determine the optimal K. Visually,
the left chart shows there are 2 clusters underlying. The chart on the right plotted gap
values generated with points in the left chart. Error bars denote half width of sk. Then
the optimal K = 2 according to Gap criterion, which is consistent to the left chart.
Figure 4.13 plotted gap values generated with our Twitter data. According to Gap criterion,
the optimal K = 2.
56
U-,
9
'V
0:
U)
0
0n
0~
'.K~s.
CD
U-,
+
0
0
0
0
U?
6
4
2
Number of Clusters
2
4
6
Number of Clusters
8
8
Figure 4.12 An example of Gap criterion
2
1.8T
1.6/
4
1.4
CL
1.2/
1
0.8
1
2
3
4
5
6
7
8
9
K (# of clusters)
Figure 4.13 Gap values with our Twitter Data (Input is reduced to 16-dim;
Clustered by Gaussian Mixture Model)
Summarizing all the 5 methods (Table 4.7), the underlying number of clusters should
range from 2 to 5, most probably 2.
On the other hand, we did K-means on our Twitter data, and let K range from 2 to 5.
Number of topics in each cluster is shown in Table 4.8. To ensure statistical significance,
we need enough topics in each cluster. So the number of clusters should not be too large
(> 3), and in the following part of the thesis, we will focus on the results with K = 2.
57
Table 4.7 Determine number of clusters
Method
Optimal number of Clusters
PCA (visually)
Not much insights
Elbow method
2
Information Criteria (AIC/BIC)
5
Silhouette Method
2
Gap Criterion
2
Table 4.8 Number of topics in each cluster (k-means)
K (# of clusters)
Cluster 1
Cluster 2
Number
of Topics
2
137
3
59
4
85
5
52
83
85
59
28
76
31
43
45
31
Cluster 3
Cluster 4
Cluster 5
66
4.5.2. Clustering results
Recall that the 220 topics are clustered by their vectors of percentage mention count
every 5 days (36-dim vectors shown in Table 4.5). Let K = 2, and apply K-means clustering
on the 220 topics. Then plot the percentage mention count vectors of topics in each of
the two clusters in Figure 4.14. Visually, Cluster 1, compared to Cluster 2, has more
fluctuation before peak day (day 120). So the total mention count of topics in Cluster 1
might be easier to predict at early stages, based on the information from the fluctuation.
Cluster ID of each topic is shown in Appendix A. If we take a deep look at the
characteristics of topics in each cluster, we could see that Cluster 1 contains topics relate
to quality issue, new product release, periodical topics, such as "Toyota recall," "GM
switch recall," "False beef," "iWatch," "Australian Open." For such topics, there should be
signals many days before the peak day. Cluster 2, on the contrary, contains topics that are
hard to predict, e.g. "Hurricane Earl," "Thatcher death," "MH370." However, these
characteristics are not strict or rigorous, for example, Cluster 1 also contains 'Government
shutdown', and 'Typhoon in Philippine', which seem to be hard to predict; and Cluster 2
contains 'NBA Finals', which seems to be easy to predict.
58
2clusters - #2
#
2etlutes -
0.81
0.8
0.4.
0.4
0
0
0
50
100
day
0
150
50
100
day
150
Figure 4.14 Percentage mention count every 5 days of topics in each cluster (K=2)
4.5.3. Filtering topics by the total mention count before peak day
The objective of clustering and modelling on each cluster is to find out if there are a subset
of topics with better predictability. Other than clustering, we can also fit the Bass model
on those topics with more mention count before peak day, and therefore might lead to
better prediction accuracy. In Section 5.4, we will evaluate the Bass model predictive
accuracy on
"
-
All 220 topics
Topics in Cluster 1 (137 topics)
Topics with cumulative mention count
Topics with cumulative mention count
10 before peak day (196 topics)
1000 before peak day (158 topics)
4.6. Notations
Here is a summary of notations in our models.
-
-
x: number of days fed to the model and used for predicting the total mention count,
o < x !; 180.
intv: we tried three time aggregation levels - mention count is aggregated to count
per hour, count per day, and count per 5 days. This aggregations level is denoted by
intv, and takes three values - 1hr (1/24 day), 1 day, 5 days.
Fitting error: mean squared error between true and fitted total mention count with
100% data (x = 180 days)
180
(N(t) - R(t)
fitting error=
180/intt
59
t=intv
-
Predicting error: defined as the mean relative error,
*
averaged over all 220 topics.
N(t), N(t): true value and estimated value of the total mention count on a certain
I(Forecast - True)/Truel
topic by time t.
-
M, p, q: true values of the Bass model parameters
r', p, q: estimated m, p, and q based on data within the first x days.
60
5. Results - Predicting Cumulative Mention Count
In this section, we will show predictive accuracy with machine learning models and Bass
diffusion model. Recall that the task of our models is to predict the total mention count
on a topic within 180 days. And ideally and hopefully the predicting error will drop to near
zero before the peak day (day 120). Also recall that the predicting error is defined as the
relative error between true and forecasted total mention count.
5.1. Machine learning models results
We have four types of models:
*
*
*
K-nearest-neighbor (KNN) with feature selection
Linear regression with stepwise feature selection
Bagged regression tree
Ensemble model
With variations, there are 8 machine learning models in all (Table 5.1). Details about the
machine learning models can be found in Section 4.3. We feed the models with features
based on the first x days, and learn from training set about the total mention count as a
function of the features.
Table 5.1 The 8 machine learning models
Machine Learning
Model Template
The way of using features
KNN
Weighted KNN
KNN
Weighted KNN
Linear regression
Linear regression
Bagged tree
Bagged tree
Not merged
Not merged
Merged
Merged
Not merged
Merged
Not merged
Merged
Model ID
1
2
3
4
5
6
7
8
5.1.1. K-nearest-neighbor with feature selection
Figure 5.1 shows the performance of KNN models, the trend of relative error as a function
of x. And Figure 5.2 shows details after the peak day (Day 120). Optimal K with different
x ranges from 1-6, mostly falls in [1,31. At the peak day, relative error increases greatly,
which does not make sense. Maybe the forecasting accuracy is very bad for topics with
small mention count, resulting to a large percentage error. After the peak day, relative
error drops greatly to 10% ~ 20%, which makes sense because typically non-cumulative
mention count drops greatly after peak day and cumulative mention count quickly
61
converges to the total mention count. So when x > 120, the model can learn the total
mention count from merely the first feature f 1 (x), average mention count per interval.
Comparing the 4 KNN variations, different variations outperform at different time stages,
or in other words, none of them dominates others over the whole time horizon. Using
merged features, or weighting neighbors by distances is not able to improve predictive
accuracy effectively.
-
- KNN
%
wtKNN_%
---- KNNmerge
.-.
a... wtKNN-merge
150
130
C
a::
LU
LU
-J
LU
110
90
70
Peak day
-1
50
30
10
-10 0
20
40
60
80
100
120
140
160
180
DAYS (X)
Figure 5.1 Relative error of KNN variations
-
- KNN_%
wtKNN_%
---- KNNmerge
--N... wtKNN merge
0.5
-
0.4
-a
0
-
-
U0.3
LU
0.2
LU
r_0.1
0
125
130
135
140
145
150
155
160
DAYS (X)
Figure 5.2 Details after peak day - relative error of KNN variations
62
165
5.1.2. Linear regression with feature selection
Figure 5.3 plots relative error of linear regression models, and Figure 5.4 shows details
after the peak day. Compared to KNN (Relative error mostly between 0~100 over the
whole time horizon), linear regression models have smaller relative error, and the relative
error curve is more stable without any big fluctuation.
-
- LR
--
LR-mer
60
50
0
LU
LU
U
40
Peak day
30
20
Cr
10
0
0
15
60
45
30
90
75
105
120
135
150
165
DAYS
Figure 5.3 Relative error of linear regression models
--
-LR
--
LR-mer
5
cc 4
0
LU
U
at1
~w.
0
120
125
130
135
140
145
150
155
160
165
DAYS
Figure 5.4 Details after peak day - relative error of linear regression models
63
5.1.3. Bagged tree
Figure 5.5 plots relative error of bagged tree models, and Figure 5.6 and Figure 5.4 shows
details after the peak day. Compared to KNN and LR models, Tree models have the
smallest relative error over most of the time horizon, and smallest fluctuation, probably
because of bagging.
-tree
-tree
merge
70
60
~50I
5
Peak day
40
30
I
R20
10
0
0
90
60
30
120
150
180
DAYS
Figure 5.5 Relative error of bagged tree models
---
tree
-treemerge
0.8
0.75
0 0.7
a 0.65
0.6
0.55
L
0.5
0.45
0.4
120
150
135
DAYS
Figure 5.6 Details after peak day - relative error of bagged tree models
64
165
5.1.4. Ensemble model
Different models outperforms at different time stages (Table 5.2), to utilize advantages of
all models, and to mitigate error peaks and fluctuations, we then apply an ensemble
model taking an average over KNN average, LR average, and Tree average as its predicted
total mention count, where
-
KNN average forecast value = average forecast value over all 4 KNN variations
LR average forecast value = average forecast value over the 2 LR models
Tree average forecast value = average forecast value over the 2 tree models
Figure 5.9 compared the four types of models at the same scale, and shows that that
ensemble model has smaller relative error than KNN, LR, or tree models, over almost the
whole time horizon.
However, results of ensemble model are still not good enough, especially before the peak
day (Day 120). At the peak day, relative error shows that the forecast value is still 30 times
as much as the true value. This is because peak day is usually triggered by news outside
Twitter, therefore it is very hard to predict before the peak day without information
outside Twitter.
Table 5.2 Best model at each time stage
Days
0~60
60~90
90~120
120-150
>150
Best model
LR
KNN
LR & Tree
KNN
LR
Peak day
/
50
45
40
~'
N
u25
20
u15
10
5
0
0
30
60
90
DAYS
120
Figure 5.7 Relative error of Ensemble model
65
150
180
5
cc
0
4
%k.-
cc-
3
LU
2
LU
c.
0
135
120
165
150
DAYS
Figure 5.8 Details after peak day - relative error of ensemble model
LINEAR REGRESSION
KNN VARIATIONS
cc
0
LU
150
kKINN
130
wtKNN%
110
KNNmerge
90
-
%
150
1
wtKNN merge
0
LU
90
70
50
50
30
30
10
10
30
60
90
DAYS
120
150
-10 0
180
150
60
90
DAYS
120
150
180
150
180
130
tree-merge
w: 110
0
90
110
90
LU
LU
70
LU
30
150
tree
130
of
'I
ENSEMBLE MODEL
TREE MODELS
U:
0
LRmer
110
70
40 0
LR
130
70
U
50
50
30
30
10
10
-10 0
30
60
90
DAYS
120
150
-10 0
180
30
Figure 5.9 Compare relative error of four types of models
66
60
90
DAYS
120
5.2. The Basic Bass model results
In this section, we focus on predictive accuracy of the Bass model, with two candidates of
predicted total mention count, and three time aggregation levels.
,
5.2.1. in or NV(180) - Which one is better in prediction?
In the Bass model, we fitted model parameters m, p, and q with N(t) - t pairs for t !; x
days. The true value of our dependent variable is N(180), cumulative mention count
within 180 days. In Section 4.4.3 we've mentioned that with the fitted parameters ii,
q, we can predict cumulative mention count in two ways, R(180) or M' (=R(oo)).
Table 5.3 shows that the fitting accuracy is smaller if we use R(180) as the predicted total
mention count. For example, when data are at hourly granularity, and use 100% data to
fit the Bass model, relative error between M' and the real total mention N(180 days) =
8%, and relative error between R(180 days) and N(180 days) = 2%.
And the 2% fitting accuracy in Table 5.3 with aggregation level of 1hr is relatively small,
showing the fundamental feasibility of applying the Bass model to describe Twitter topic
diffusion process.
Figure 5.10 shows that no matter what time aggregation level is, and how many days of
information is used to predict, R(180) is always a better prediction than M.
One reason to explain the better performance of N(180) might be that for some topics,
the cumulative mention count keeps growing and does not converge to a plateau within
the 180 days horizon. Figure 5.11 shows such a topic - #YOLO (you only live once). M' is
the total mention when time goes to infinity, and nearly no one mentions this topic any
more, which does not hold for topic like #YOLO.
So in the following part, we will use R(180) as the predicted total mention count within
180 days.
Table 5.3 Fitting accuracy (relative error with 100% data) with iit or A(180)
(1
N(180)
M
1hr
2.0%
8.0%
id
5.4%
1922.9%
5d
12.6%
8410.1%
)
Time aggregation
level
67
Aggregated to 1 day
Aggregated to 1hr
50
5
N(180)
N(180)
--4
40
3
30
2
20
-e -m
0
a)
a)
10
0
0
0
30
60
90
120
150
180
30
0
Days known to the model
60
90
120
150
Days known to the model
Aggregated to 5 days
50
N(180)
40
2
30
20
10
0
0
30
60
90
120
150
180
Days known to the model
Figure 5.10 The Bass model predictiveaccuracy with M' or N(18 0)
0
0
S0
6
0
2
Sx
E
3
0
2
1 ada
0
30
60
90
day
120
150
180
Figure 5.11 Cumulative mention count of Topic 8 - #YOLO
68
180
5.2.2. Three time aggregation levels
We aggregated the cumulative mention count into 3 aggregation levels - every 1 hr, every
1 day, and every 5 days.
Figure 5.12 and Figure 5.13 plot predictive accuracy as a function of x. We can see that
small aggregation level (1hr, solid line) has smaller relative error in prediction, over the
whole time horizon.
o
800%
1hr
700%
Iday
600%
5day
500%
400%
300%
100%
-.
.
200%
ft
0%
-100% 0
15
30
45
60 75 90 105 120 135 150 165 180
DAY (FIRST X DAYS)
Figure 5.12 Predictive accuracy with 3 aggregation levels
0
100%
1hr
80%
0
1day
0-5day
60%
LU
V
40%
LU
20%
0%
15
30
45
60
75 90 105 120 135 150 165 180
DAY (FIRST X DAYS)
Figure 5.13 Details - predictive accuracy with 3 aggregation levels
69
If we take a deeper look at the fitted curves with different aggregation levels, at different
time state, for certain topics (Figure 5.14 and Figure 5.15), we will find that larger
aggregation level in fact fits the cumulative mention count curve better. However, we
only care about cumulative mention count at Day 180, and smaller aggregation level fits
this point better and therefore leads to smaller predicting error.
So in the following part, we will focus on the aggregation level of 1hr for the Basic Bass
model.
Event 89 -
105
10
iwatch-90days
0 105
true
C
1hr
id
5d
0
4
8
0
~
6
Event 89 - iwatch-120days
E
E
E
-
------
2
6
4
truel
Sihr
- 1
5d
2
50
0
x
100
day
200
150
0
105 Event 89 - Iwatch-150days
10,
8
200
150
1 Event 89 - Iwatch - 180days
8
/
0
100
day
50
0
6
E
6
>0
4
E 4
a
true
-- 1hrI
id
5d~
2
50
100
day
150
Ihr
21::d
0L
0
---.......
true
---
0'
0
200
5d
..........
50
100
day
150
200
Figure 5.14 Fitted curve when the Bass model is fed with data in the first 90, 120,
150,180 days (Topic 89 - iWatch)
70
5
5
Event 203 - #ISS-90days
5 x105
Event 203 -#ISS-120days
-true
o4
Fd
Cr
E 3
24
d
1
2
El
75
~
-
--- - - --~-----
--
E1
50
0
10
S
100
day
100
day
50
0
200
150
1hr
1d
5d
200
150
105Event 203 - #ISS - 180days
Event 203 - #ISS-1 50day s
o4
6
E
3
3
E
2
E-
- -1hr
1d
5d
6d
0
0
50
100
day
150
- -1hr
00
200
-
E
E 1
50
100
day
150
200
Figure 5.15 Fitted curve when the Bass model is fed with data in the first90, 120,
150, 180 days (Topic 203 - #ISS)
5.3. Machine learning vs. the Bass model
Figure 5.16 and Figure 5.17 shows the predictive accuracy of machine learning ensemble
model and the Basic Bass model with aggregation level of 1 hr. The Bass model has the
smaller relative error over almost the whole time horizon.
Using only mention count, the Bass model is no worse than machine learning models with
extra features (e.g. % root tweet) over the whole time horizon. It means the Basic Bass
model partially captures the underlying mechanism of Twitter topic development process.
Twitter topic development process is fundamentally similar to new product diffusion, but
with very large p (impact from out-of-network), and is therefore hard to predict before
peak day (the 1 2 0 th day).
Note that although the Bass model is much better in prediction compared to machine
learning models, it still cannot predict well before the peak day, without information
outside Twitter, e.g. news triggering the Twitter topic explosion.
71
In addition, recall that in Section 2.3.3, we talked about the intuition that the prediction
power of the Bass model is limited, especially when we fit the curve with only points
before the jump (day 120) of S-curve, because different sigmoid function look similar at
early stage (Figure 5.19). Out results on Twitter data confirmed this intuition. Figure 5.18
shows that even with the best aggregation level (1hr), the predicting error is still large
before peak day, and especially before Day 30.
0
50
-- *--cluster 1 only
45
,--all
220 topics
/
40
35
30
25
20
15
10
5
0
0
90
60
30
12 0
150
day
Figure 5.16 Machine learning vs. Basic Bass
1.4
0
1.2
C
1
C
0.8
4-J
0.6
CL
0.4
-e-cluster 1 only
0.2
all 220 topics
0
0
15
30
45
60
75
90
105 120 135 150 165 180
day
Figure 5.17 Details - Machine learning vs. Basic Bass
72
180
140%
120%
r 100%
U80%
-60%
M40%
20%
0%
15
0
30
45
60
75
90
105 120 135 150 165 180
DAY (FIRST X DAYS)
Figure 5.18 Predictive accuracy of the Bass model (aggregation level is 1hr)
2
-6-
Sigmoid 1 = 1/(1+exp(-2x+10))
-- &-Sigmoid 2
1.6
=2/(1+exp(-2x+10))
.Sigmoid
3 = 1.5/(1+exp(-2x+14))
1.2
0.8
0.4
0
2
6
4
8
Figure 5.19 Different sigmoid function look similar at early stage
5.4. Modeling on a subset of topics
In this section, we will model on
*
-
All 220 topics
Topics with cumulative mention count >= 10 before peak (196 topics)
Topics with cumulative mention count >= 1000 before peak (158 topics)
Topics in Cluster 1 (137 topics)
and see if there exists a subset of topics with better predictability.
73
10
5.4.1. Machine learning on Cluster 1 only
Figure 5.20 recalls the clustering result on 220 topics with K-means when K = 2. Topics in
Cluster 1 has mention count curves with more fluctuations before the peak day, and
therefore should provide more information for our models to predict. And Table 5.4
shows that the number of topics and topic total mention count do not differ significantly
between Cluster 1 and Cluster 2.
Figure 5.21 and Figure 5.22 compare the predictive accuracy of machine learning
ensemble model on all 220 topics vs on Cluster 1 only. During Day 60 to Day 120,
predictive accuracy of Cluster 1 is much smaller than all 220 topics. It means mention
count fluctuation before peak day does offer more information and make it easier to
predict the total mention count.
2clusters - #2
2clusters - #1
0.8
0
0.6
0
E 0.4
E
0 .4
II
0 .2
0.2
0
0
50
100
day
0
150
100
day
50
150
Figure 5.20 Percentage mention count every 5 days of topics in each cluster (K=2)
Table 5.4 Compare topics in two clusters
Cluster ID
#1
#2
Number of topics
137
83
Average total mention count
1.675*106
0.506*106
74
50
45
----
cluster 1 only
aI 220 tonics
o 40
l35
toD
~30
~25
~15
Z 10
5
0
0
150
120
90
60
days known to the model
30
180
Figure 5.21 Predictive accuracy of machine learning ensemble model - Al 220
topics vs Cluster 1 only
5
4.5
o 4
'3 3.5
cluster 1 only
all 220 topicS
.~3
2.5
2
-1.5
-~1
0.5
0
1 20
150
135
165
18 0
days known to the model
Figure 5.22 Details after the peak day - Predictive accuracy of machine learning
ensemble model - All 220 topics vs Cluster 1 only
5.4.2. The Bass model on a subset of topics
From Table 5.5 and Figure 5.23, we know that at aggregation level of 1hr, compared to
the average relative error over all 220 topics, a certain subset of topics have relatively
smaller predicting error over most part of time horizon. For example, applying the Bass
75
model on topics with mention count before peak > 1000, predicting error after Day 30
can be reduced by 1% ~ 12% compared to the average predicting error over the all 220
topics.
Table 5.5 Relative error of the Bass model on a subset of topics
Days known
to the model
Baseline: all
220 topics
count before
peak >=10
count before
peak >=1000
cluster 1
only
15
150%
95%
91%
163%
60
145%
96%
92%
91%
75
89%
88%
90
86%
100
110
120
130
140
150
165
180
84%
85%
82%
82%
80%
78%
75%
69%
66%
43%
13.1%
6.5%
2.0%
39%
12.0%
5.8%
1.3%
172%
94%
88%
87%
84%
79%
76%
72%
66%
56%
31%
14.5%
7.2%
2.0%
30
45
90%
76
94%
89%
88%
85%
82%
79%
76%
70%
63%
32%
9.5%
4.9%
1.2%
15%
10%
~.1
5%
0%
S-5%
--.--
-10%
-10%
0
15
30
45
peak count >=10 (196 events)
cluster 1 only
105 120
90
75
60
Days known to the model
135
150
165
180
Figure 5.23 How much relative error of the Bass model can be reduced by using
a subset of topics (Data are aggregated to 1 hr)
However, even focusing on the subset of topics with better predictability, predicting error
before peak day is still above 60% which is too large. It again shows the importance of
information outside Twitter.
77
78
6. Discussion on the Parameter Estimation Methods
of the Bass model
The Basic Bass model has three parameters m, p, and q. Parameter estimation procedure
is essential in this study since it directly determines the predictive accuracy. In this section,
we will review existing methods of estimating the Bass model parameters, and then
evaluate two methods with our Twitter data.
6.1. Review of existing methods to estimate the Bass model
parameters
Recall that the key equation of the Bass model implies that the newly adopters within
(t, t + dt) is determined by the number of adopters N(t), market potential or the total
possible adopters m = N(oo), innovation coefficient p and imitation coefficient q.
n (t)
q
p[m - N(t)] + -N(t)[n
dN(t)
dt
dt
m
- N(t)]
Solving the above differential equation, with boundary condition N(O) = 0, we get the
expression of cumulative number of adopters
N(t) = m
1 - e-(p+qt
And other variables, including non-cumulative number of adopters n(t), time of peak
adoption t*, and number of adopters at the peak time n(t*) can be derived as follows:
n(t) =
V*=
p(p + q)2e-(p+q)t
m [p + qe-(p+q)t]2
dN(t)
dt
1
In-Ppn
p+q
q
n(t*) =
m
-
4q
(p + q) 2
6.1.1. Evaluate estimation methods by fitting error and one-step-ahead
predicting error
Mahajan et al. (1985) compared four procedures for estimating m, p, and q in the Basic
Bass model when t is continuous. They are ordinary least squares (OLS), maximum
likelihood estimation (MLE), nonlinear least squares (NLS), and algebraic estimation (AE).
Their results showed that NLS is the better than other three in terms of both fitting
accuracy, and one-step-ahead predictive accuracy.
79
.jL
orinry Le (s1. Squares Es tirmatior
OLS)
OLS was originally suggested by Bass (1969). This method assumes time intervals are
equal to unity (e.g. yearly data), and applies OLS to fit the discretized key equation of the
Bass model
q
n(k) = p[m - N(k - 1)] + -N(k
- 1)[m - N(k - 1)]
m
= a * N(k - 1)2 + b * N(k - 1) + c
Apply OLS to get optimal a, b, and c (min MSE of n(k)), and then solve for optimal m, p,
and q.
-b k Vb 2 - 4ac
q
2
-p+q=b-p~= ~ am
~ +bm+c= 0 =
-
_a
-
pm = C
m
q= -am
There are two sets of solutions to the equation system on the left
-b + lb 2
m1=
2a
-
2
-b-b
4ac
or
q, = -am,
-
4ac
2a
2
C
q2= -am
2
Theoretically, m, p, and q are all positive, and we assume p < q to ensure the peak time >
0, therefore a = -q/m < 0, b = -p + q > 0 and c = pm > 0. So lb 2 - 4ac > b, and
m, < 0, which means the first solution is counter-intuitive, and therefore we will use the
second solution.
Advantage of OLS are:
-
It is easy to implement.
OLS is applicable to many diffusion models other than the Bass model.
Disadvantages are:
-
In the presence of only a few data points, N(k - 1)2 and N(k - 1) are likely to be
multicollinear, and parameter estimates are unstable or possess wrong signs.
Since m, p, and q are nonlinear functions of a, b, and c, the standard errors of m, p,
and q are not available.
There is a time interval bias since n(k) will overestimate the derivative of N(t) taken
at t = k - 1 before peak time and will underestimate after the peak time.
80
6.1.1.2. Maximum Likelihood Esitnte (MLE)
Schmittlein and Mahajan (1982) suggested an MLE procedure. Let f(t) denote the
probability of adoption at time t, and the C.D.F. F(t) = fof (r)dT. Then
n(t)
f(t) = m N(t)
1 - e~(p+qt
N(t)
qe-(p+qNt
1 +
m
Let M denote the total population, and then m is the part of population that will finally
adopt when time goes to infinity. Let c = m/M. F(t) can be treated as the cumulative
adoption probability at time t conditioning on that whole population will adopt finally.
Then the unconditional C.D.F. is
1
-e-+4
*
G(t) = cF(t) = c
1+1
e-(p+qut
Then the likelihood function for the observed histogram up to T time intervals for which
the data are available can be written as
L(T) = [1 - G(T)] M-N(T)
171
[G(i) - G(i - 1)]n(i)
Advantages of MLE include:
-
The time interval bias is eliminated since MLE use appropriate aggregation of the
continuous time model over the time intervals represented by the data.
Schmittlein and Mahajan (1982) provided formulae for approximate standard errors
of p, q, and Mi to examine the stability of estimated parameters.
Disadvantages are:
*
Estimation of m and the development of predictions require the prior knowledge of
M.
" The approximated standard error considers only sample error and ignores all other
sources of error.
6.1-7.3. Nonlinear Least Squares Estimation (NL S)
NLS was designed to overcome shortcomings of MLE. There are three variation of NLS,
suggested by Srinivasan and Mason (1985), Jain and Rao (1985), and Mahajan et al. (1985),
respectively fitting the following three nonlinear expressions to get m, p, and q.
81
1
n(tk) = N(tk) - N(tk1) =
e-(p+q)tk
m1+ e -(p+q)tk
-
-(p+q)tk-1
- e~(p+q)tk
F~ty - F~y_1)1
where F(tk) =
F(tk)
1 - F(tm1)
1+1 -(p+q)t
(m - N(t1)F(tk)
N(tk) = M1
e-(p+q)tk-1
p
p
n(tk) =
1 _
m1 +
-
1 + e-(p+q)tk
+1 e-(p+q)tk
(2)
3
Mahajan et al. (1985) suggested NLS (2) is the best in terms of fitting error and one-stepahead predicting error.
Advantages of NLS are:
- It eliminates time interval bias
" The procedure accounts for all sources of error thereby providing valid standard error
estimates.
Disadvantages are:
-
Slow to converge, or may not converge
-
Sensitive to starting values, and might converge to local minimum. Starting values
need to be obtained by other parameter estimation methods.
-
14
' . Agebrfc Estimation (A E)
AE was proposed by Mahajan and Sharma (1985), and was used to provide a rough
estimation or good starting values for MLE and NLS. It requires knowledge at peak time
t*, n(t*), and N(t*), and then
n(t*)(m - 2N(t*))
1_
t*
J
P
m
n(t*) = - (p + q)2
4q
N(t*) = m
q
(m - N(t*)) 2
n(t*)m
n(t*))
(m - N(t*))2
m
m - N(t*) In
m - 2N(t*)
2n(t*)
pt*=
2q
is used to find m numerically. Once m is known, p
m
In m-2N(t*)
Consequently, t* = m~N(t*)
2n(t*)
and q are known.
Advantages:
-
Conceptually and computationally easy
Can be used to provide good starting values for MLE and NLS
82
Disadvantages:
-
Does not provide standard errors for the estimated parameters
it has time-interval bias
Not applicable if n(t) has not yet peaked
6.1.2. Discrete Bass model
In reality, time t is usually discretized. Some of the above methods can be accommodated
to the situation with discrete time, such as MLE, NLS, and AE.
However when OLS estimates m, p, and q, it does not require any information about time
(scale of unit), and therefore applying the estimated m, p, and q to the expression of
N(t)
-
N(t) = m
1 + 1e-(p+q#
is mathematically questionable.
Satoh (2001) provided the explicit formula for N(k) in discrete Bass model. It is based on
Riccati equation (Hirota 1979), which says
du
dt
=
a + 2bu + cuz
Parameter a, b, and c can even be function of time a(t), b(t), c(t). But in the case of
discrete Bass model, they are all constants.
A discrete version of Riccati equation is
utt+S)-utt-6)
2 +=
26
a + b * (u(t + 6) + u(t - 6)) + cu(t + 6)u(t - 6)
where &is the constant time interval. Solution to discrete Riccati equation is
-
C+ + C.exp(fl(t - to))
1+ exp(fl(t - to))
where
+ 4b 2 - ac
C+ =b
tanh(ffl)
C
=
2vb-a
83
Then discrete Bass model can be derived based on continuous Bass model and Riccati
equation as follows:
)
N(Na+1 +N_ 1) - Nn+1N1
Nn+1 - Nn_1
28m
Nn+1 + Nn-1 +
/
2
m N1+Nn)-Nn,
-)
And the solution to discrete Bass model is
n
5(p+q) 7
1 - S
\M+5(p+q))
7
1+ I(1 - (p +q))
p\1+6(p+q)/
where n = t/S. Noted that data have to be collected periodically because the time
interval 8 is a constant. And when 8 -> 0,
N~t
N~t)
= =m
-8(p + q))
1 -Q{
\1 +8(p +q))25 M -ee~(p+q)t
S S(p+q)p+q)t
)
S
_______
\1 + 8(p+q)
6.2. Parameter estimation procedure in this thesis
In our research with Twitter data, we applied NLS, the best method according to Mahajan
et al. (1985). Although Mahajan suggested fitting n(tk) will result to smaller fitting error
and smaller predicting error than fitting N(tk), we will still fit N(tk) because the metric
we care is different from the metric Mahajan used. We care cumulative mention count
N(tk)
more than noncumulative mention count n(tk), while Mahajan predicted n(tk)-
So in this thesis, we fit the N(t) - t curve to find optimal m, p, and q. For example, when
we have data in first x days, and we want to predict N(180), our fitting cost function is
fitting error =
1
1
1
(N(t) - N(t))
t=intv
The with the fitted m, p, and q, and the formula of N(t), we have R(180) as a prediction
of N(180).
6..., Ealuate OLS and NLS with Twitter data
With Twitter data at the aggregation level of 1hr, we applied both OLS and NLS to predict
the total mention count N(180), and showed results in Figure 6.1. NLS outperforms OLS
84
over almost the whole time horizon, which is consistent to the conclusion in Mahajan et
al. (1985).
200%
'9
I'
I
160%
0
CU
I
I
I
120%
I
.01 .m.
feOp.
80%
40%
0%
0
15
30
45
60
75
90
105
120
135
150
165
days known to the model
Figure 6.1 Predicting error - OLS (dashed line) vs NLS (solid line)
85
180
86
7. Conclusions and Summary
Much work has been done on applying diffusion models in social network analysis since
Twitter introduced hashtags in 2010. Most of them tried to predict the user amount of a
whole social network website, such as Wang (2011), and Bauckhage et al. (2014). A few
of them focused on smaller granularity - hashtag popularity, e.g. Yang and Leskovec (2010)
predicted by linear influence model, Ma et al. (2013) predicted by classification. However
to our knowledge, no one has applied the Bass model to predict topic popularity, where
topic is specified by keywords or hashtag.
We applied machine learning regression models and the Basic Bass model to predict the
total mention count of each Twitter topic. Different machine learning models - KNN,
linear regression, bagged trees - outperforms at different time stages. And an ensemble
model takes the advantage of all the above three models, and lead to better prediction
over almost the whole time horizon, especially before the peak day.
Our results also reveal the fundamental feasibility of applying the Bass model to describe
Twitter topic diffusion process. Compared to machine learning models, the Bass model
dramatically reduces the prediction error before the peak day (Day 120). Using only
mention count, over the whole time horizon, the Bass model is no worse than machine
learning models with extra features (e.g. % root tweet). It means the Basic Bass model
partially captures the underlying mechanism of Twitter topic development process. And
we can analogue Twitter topics' adoption process to a new product's diffusion process.
There exists a subset of topics with a mention count pattern easier to predict compared
to others. These topics usually have more mention count, or more fluctuation in the curve
of mention count before peak day, and therefore offer more information to the model to
predict the total mention count at early time stage.
However, even focusing on the subset of topics with better predictability, and applying
the Bass model (better than machine learning ensemble model), the predictive accuracy
is still not good enough, especially before the peak day (Day 120). This is because peak
day is usually triggered by news outside Twitter, therefore it is very hard to predict before
the peak day without information outside Twitter. Another reason to explain the
predicting error at early stage is that the Bass model assumes that cumulative mention
count follows a sigmoid function, and different sigmoid functions are very similar at early
stage.
87
88
8. Future Work
In the future, we could first try other possible models to predict hashtag popularity. For
example, Linear Influence Model (Yang and Leskovec 2010) is a choice other than Bass, to
model the influence of those already tweeted to those that have not, and.to describe the
diffusion process. Logistic Growth Model (Chang and Wang 2011), Weibull model, and
Gompertz model (Bauckhage et al. 2014) can also be applied to describe the diffusion
process.
In our case, since the prediction mention count will be used later to predict demand on a
new product, classification might not be accurate enough. Or, if we can tolerate the
inaccuracy of classification compared to regression, then we could use classification in
our future research, which would be easier than regression.
The Generalized Bass model is also a choice as extension of our current the Basic Bass
model, it assumes the diffusion rate is affected by real time marketing influence denoted
by x(t). x(t) can be price, advertising power, or any other factors at time t. And the noncumulative mention count can be written as:
n(t) = (p[m - N(t)] + qN(t)[n - N(t)] * x(t)
Second, there are other features that can be include in our machine learning models, such
as some features used by Ma, et al. (2013) - "Binary attribute checking whether a hashtag
contains digits," "Number of segment words from a hashtag."
Third, there are other available data sources, such as news websites and blogs (Yang and
Leskovec 2010). Sources outside Twitter is essential in our research, because there are a
large percentage of exogenous trends, trends originating from outside of Twitter (Section
2.4.1, e.g., earthquake, sports games). If we have knowledge of the trigger news outside
Twitter, it would be very helpful to improve the predicting power of our models.
Fourth, there might be a way to incorporate Bass diffusion model with machine learning
models. During this study, we planned to predict m, p, and q based on features. We can
treat m, p, and q as functions of features (e.g. % of root tweets), and learn the functions
from the training set. However, in the test set, we predict m, p, and q based solely the
features, not the mention count every hour. So in this way, we are not utilizing the
underlying diffusion structure captured by the Bass model, and therefore not taking the
advantage of the Bass model over machine learning models. So the prediction accuracy
might be similar as the pure machine learning models we did before. We hope to find a
good way to combine the Bass model and machine learning in the future.
Finally, we can build a bridge between Twitter mention count on a certain product and
the demand on this product, and therefore make our models financially meaningful to
companies.
89
90
BIBLIOGRAPHY
1) Agarwal, Deepak, Bee-Chung Chen, and Pradheep Elango. "Spatio-temporal models
for estimating click-through rate." Proceedings of the 18th international conference
on World wide web. ACM, 2009.
2) Archak, Nikolay, Anindya Ghose, and Panagiotis G. Ipeirotis. "Deriving the pricing
power of product features by mining consumer reviews." Management Science 57.8
(2011): 1485-1509.
3) Bass, Frank. "A new product growth for model consumer durables." Management
Science 15.5 (1969): p215-227.
4) Bass, Frank M., Trichy V. Krishnan, and Dipak C. Jain. "Why the Bass model fits without
decision variables." Marketing science 13.3 (1994): 203-223.
5) Bauckhage, Christian, Kristian Kersting, and Bashir Rastegarpanah. "Collective
attention to social media evolves according to diffusion models." Proceedings of the
companion publication of the 23rd international conference on World wide web
6)
7)
8)
9)
10)
companion. International World Wide Web Conferences Steering Committee, 2014.
elen, Bogaghan, and Shachar Kariv. "Observational learning under imperfect
information." Games and Economic Behavior 47.1 (2004): 72-86.
elen, Bogaghan, and Shachar Kariv. "An experimental test of observational learning
under imperfect information." Economic Theory 26.3 (2005): 677-699.
elen, Bogaghan, Shachar Kariv, and Andrew Schotter. "An experimental test of advice
and social learning." Management Science 56.10 (2010): 1687-1701.
Chang, Hsia-Ching. "A new perspective on Twitter hashtag use: Diffusion of innovation
theory." Proceedings of the American Society for Information Science and
Technology 47.1 (2010): 1-4.
Chen, Yubo, and Jinhong Xie. "Online consumer review: Word of mouth as a new
element of marketing communication mix." Management Science 54.3 (2008): 477491.
11) Guadagno, Rosanna E., et al. "What makes a video go viral? An analysis of emotional
contagion and Internet memes." Computers in Human Behavior 29.6 (2013): 23122319.
12) Hirota, Ryogo. "Nonlinear partial difference equations. V. Nonlinear equations
reducible to linear equations." Journal of the Physical Society of Japan 46.1 (1979):
312-319.
13) Jain, Dipak C., and Ram C. Rao. "Effect of price on the demand for durables: Modeling,
estimation, and findings." Journal of Business & Economic Statistics8.2 (1990): 163170.
14) Ma, Zongyang, Aixin Sun, and Gao Cong. "On predicting the popularity of newly
emerging hashtags in Twitter." Journal of the American Societyfor Information Science
and Technology 64.7 (2013): 1399-1410.
91
15) Mahajan, V., C.H. Mason and V. Srinivasan: An evaluation of estimation procedures
for new product diffusion models. In V. Mahajan and Y. Wind (eds.): Innovation
Diffusion Models 0] New Product Acceptance (Ballinger Cambridge, Massachusetts,
1986), 203-232.
16) Mahajan, Vijay, and Subhash Sharma. "A simple algebraic estimation procedure for
innovation diffusion models of new product acceptance."Technological Forecasting
and Social Change 30.4 (1986): 331-345.
17) Naaman, Mor, Hila Becker, and Luis Gravano. "Hip and trendy: Characterizing
emerging trends on Twitter." Journal of the American Society for Information Science
and Technology 62.5 (2011): 902-918.
18) Netzer, Oded, et al. "Mine your own business: Market-structure surveillance through
text mining." Marketing Science 31.3 (2012): 521-543.
19) Robinson, Bruce, and Chet Lakhani. "Dynamic price models for new-product
planning." Management science 21.10 (1975): 1113-1122.
20) Rogers, E.M.. Diffusion of innovations 5th ed.. New York: Free Press (2003).
21) Satoh, Daisuke. "A discrete Bass model and its parameter estimation." Journal of the
Operations Research Society of Japan-Keiei Kagaku 44.1 (2001): 1-18.
22) Schmittlein, David C., and Vijay Mahajan. "Maximum likelihood estimation for an
innovation diffusion model of new product acceptance." Marketing sciencel.1 (1982):
57-78.
23) Schweidel, David A., and Wendy W. Moe. "Listening In on Social Media: A Joint Model
of Sentiment and Venue Format Choice." Journal of Marketing Research 51.4 (2014):
387-402.
24) Srinivasan, V., and Charlotte H. Mason. "Technical Note-Nonlinear Least Squares
Estimation of New Product Diffusion Models." Marketing science 5.2 (1986): 169-178.
25) Stouffer, Daniel B., R. Dean Malmgren, and Luis AN Amaral. "Log-normal statistics in
e-mail communication patterns." arXiv preprint physics/0605027(2006).
26) Szabo, Gabor, and Bernardo A. Huberman. "Predicting the popularity of online
content." Communications of the ACM 53.8 (2010): 80-88.
27) Tirunillai, Seshadri, and Gerard J. Tellis. "Does chatter really matter? Dynamics of usergenerated content and stock performance." Marketing Science 31.2 (2012): 198-215.
28) Ulrich, Rolf, and Jeff Miller. "Information processing models generating lognormally
distributed reaction times." Journal of Mathematical Psychology37.4 (1993): 513-525.
29) Van Breukelen, Gerard JP. "Psychometric and information processing properties of
selected response time models." Psychometrika 60.1 (1995): 95-113.
30) Wallsten, Kevin. ""Yes we can": How online viewership, blog discussion, campaign
statements, and mainstream media coverage produced a viral video
phenomenon." Journal of Information Technology & Politics 7.2-3 (2010): 163-181.
31) Wang, Chen-Ya. "A PRELIMINARY FORECASTING WITH DIFFUSION MODELS: TWITTER
ADOPTION AND HASHTAGS DIFFUSION." (2011)
92
32) Yang, Jaewon, and Jure Leskovec. "Modeling information diffusion in implicit
networks." Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE,
2010.
33) Zaman, Tauhid, Emily B. Fox, and Eric T. Bradlow. "A Bayesian approach for predicting
the popularity of tweets." The Annals of Applied Statistics 8.3 (2014): 1583-1611.
93
94
Appendix A The List of T op ics Used in this Research
Topic Name (denoted by a
hashtag or keywords)
Peak Time
The total
mention count
Cluster
ID
2Chainz
07 -May-2012
1250460
1
#AintNobodyGotTimeForThat
1i -Jul-2012
294508
#ratchet
07 -Nov-2012
483155
2
1
#ChiefKeefMakesMusicFor
17; -Sep-2012
#coolstorybro
l8 -Mar-2012
7873
66080
#Struggle
07 -Nov-2012
120256
#TurntUp
-Oct-2012
331318
2
#yolo
2E -Mar-2012
5190369
#ThatShitlDontLike
0! -Sep-2012
66505
#hirihanna
1; -Oct-2011
11601
#ModernSeinfeld
12 -Dec-2012
771
2
1
#DrakesMusicWillHaveYou
17 -Dec-2012
55363
2
#endoftheworldconfessions
21 -Dec-2012
581195
2
#2012regrets
27 -Dec-2012
140775
2
#MSL
OE -Aug-2012
383626
2
#EDL
01 -Sep-2012
117272
1
#Sandy
-Oct-2012
4600864
1
#ISS
-Oct-2012
104387
1
#Synchro
05 -Aug-2012
4324
1
#Swimming
29 -Jul-2012
265618
1
#Endeavour
21
0 -Sep-2012
53826
1
#spottheshuttle
3C -Sep-2012
21J
73274
#austin
1E -Nov-2012
220137
2
1
#Bengals
-Jan-2012
141095
1
#WhoDey
-Jan-2012
62526
1
#pandaAl
-Apr-2012
1167
2
#askneil
24 -Oct-2012
5726
2
#bondtweets
23 -Oct-2012
#london2012
27)-Jul-2012
654
5212299
2
1
#oneweb
2-)-Jul-2012
9153
2
#openingceremony
27)-Jul-2012
678721
2
'
Topic ID
95
32
#blur
02-Jul-2012
39524
33
#RoyalBaby
03-Dec-2012
140523
34
#KONY2012
07-Mar-2012
2305934
35
#StopKony
07-Mar-2012
2094112
36
#TwitternWie1989
09-Nov-2012
3358
37
#JournalistBerlin
09-Nov-2012
286
38
#Houla
27-May-2012
30123
39
#Damascus
18-Jul-2012
316768
40
#Syria
18-Jul-2012
7625671
41
#AskPele
27-Jun-2012
1379
42
#WorldCup
11-Sep-2012
63736
43
#PrayForMuamba
17-Mar-2012
441350
44
#CFC
28-Oct-2012
2846990
45
#VMARedCarpet
06-Sep-2012
67648
46
#nfltotalaccess
05-Feb-2012
6917
47
#summerwars
20-Jul-2012
97225
48
#ntv
20-Jul-2012
705685
49
Obama
07-Nov-2012
51927099
50
Gulf Oil Spill
30-Apr-2010
450756
51
Haiti Earthquake
13-Jan-2010
429935
52
Pakistan Floods
27-Aug-2010
44576
53
Koreas Conflict
23-Nov-2010
394
54
Chilean Miners Rescue
13-Oct-2010
38100
55
Chavez Tas Ponchao
25-Jan-2010
9222
56
Wikileaks Cablegate
10-Dec-2010
14081
57
Hurricane Earl
31-Aug-2010
149641
58
Prince Williams Engagement
16-Nov-2010
4290
59
World Aids Day
01-Dec-2010
153741
60
Apple iPad
27-Jan-2010
852505
61
Google Android
20-May-2010
366282
62
Apple iOS
22-Nov-2010
227047
63
Apple iPhone
07-Jun-2010
1645107
64
Call of Duty Black Ops
09-Nov-2010
430875
65
New Twitter
28-Sep-2010
2485215
96
HTC
15-Sep-2010
1250908
RockMelt
08-Nov-2010
120515
MacBook Air
20-Oct-2010
601737
Google Instant
08-Sep-2010
242225
#rememberwhen
22-Nov-2010
346771
#slapyourself
17-Nov-2010
274294
#confessiontime
07-Nov-2010
312491
#thingsimiss
05-Dec-2010
293510
#ohjustlikeme
20-Nov-2010
49351
#wheniwaslittle
10-Aug-2010
739665
#haveuever
17-Nov-2010
111262
#icantlivewithout
04-Nov-2010
157055
#thankful
25-Nov-2010
340865
#2010disappointments
02-Dec-2010
195233
toyota and recall
29-Jan-2010
9173
GM and switch
30-Mar-2014
2007
Infantino and baby sling
24-Mar-2010
46
#myNYPD
22-Apr-2014
142702
Boston marathon
15-Apr-2013
2470570
SONY and Korea
24-Dec-2014
38577
burger king and lettuce
19-Jul-2012
391
horsemeat
15-Jan-2013
339336
#McDStories
18-Jan-2012
25178
iwatch
09-Sep-2014
872266
Amazon Fire Phone
18-Jun-2014
283652
XBox One
02-Dec-2014
4877345
Kraft belVita
11-Nov-2011
189
Ebola
02-Oct-2014
37062214
Senator proposal
08-Mar-2013
2598
Japan Secrecy Bill Law
06-Dec-2013
480
Virginia AG recount
26-Nov-2013
1313
New Pontifex
13-Mar-2013
1291
IPL tournament
21-May-2013
6246
Castle in the Sky
25-Aug-2013
13460
97
100
Northern India flooding
21-Jun-2013
1688
101
Brazilian protests
22-Jun-2013
15396
102
Vine resume
21-Feb-2013
2869
103
Primetime Emmys
23-Sep-2013
8679
104
DOMA Prop8
26-Jun-2013
4843
105
Sochi Olympics
18-Dec-2013
686932
106
Salvage Costa Concordia
16-Sep-2013
20988
107
Italy election
26-Feb-2013
43998
108
World Youth Day
28-Jul-2013
35591
109
#hochwasser
03-Jun-2013
154423
110
Australian election
07-Sep-2013
40858
111
Kobe Cuban
24-Feb-2013
11516
112
German election
22-Sep-2013
32456
113
OneDirection
26-Aug-2013
544327
114
#aufschrei
25-Jan-2013
108686
115
Fashion Week
15-Feb-2013
1097929
116
Asiana 214
06-Jul-2013
55804
117
anniversary
50th
Washington
118
#IranTalks
24-Nov-2013
47603
119
#NobelPeacePrize
11-Oct-2013
26398
120
France World Cup
19-Nov-2013
71267
121
Dilma
06-Sep-2013
3217901
122
typhoon Philippines
09-Nov-2013
679418
123
Red panda
24-Jun-2013
88255
124
#Troon
17-Apr-2013
169696
125
Australian Open
27-Jan-2013
367373
126
Tour de France
21-Jul-2013
687379
127
Academy Awards
25-Feb-2013
293091
128
#RockInRio
16-Sep-2013
460624
129
MTV VMAs
26-Aug-2013
254083
130
#Ashes
17-Dec-2013
1153050
131
#MalalaDay
12-Jul-2013
92522
132
#MariagePourTous
23-Apr-2013
898580
[
arch
28-Aug-2013
98
59646
133
March Madness
21-Mar-2013
1163560
134
Thatcher death
08-Apr-2013
157218
135
jason collins gay
29-Apr-2013
259639
136
#stanleycup
25-Jun-2013
546988
137
#Inauguration
21-Jan-2013
170746
138
#DoctorWho
23-Nov-2013
1828363
139
#ThankYouSachin
15-Nov-2013
1116119
140
#SFBatkid
15-Nov-2013
285846
141
#RIPMandela
05-Dec-2013
159670
142
World Cup Draw
06-Dec-2013
255758
143
#ThankYouSirAlex
08-May-2013
678244
144
#StandWithRand
07-Mar-2013
489851
145
#2013MAMA
22-Nov-2013
545670
146
#UCLFinal
25-May-2013
437898
147
#PLL
28-Aug-2013
2495315
148
#Sharknado
12-Jul-2013
547256
149
Government shutdown
01-Oct-2013
1705491
150
#StandWithWendy
26-Jun-2013
495244
151
#SB47
04-Feb-2013
568084
152
#NBAFinals
21-Jun-2013
1523961
153
Wimbledon
07-Jul-2013
2015368
154
Eurovision
18-May-2013
1693285
155
#RoyalBaby
22-Jul-2013
1592166
156
#NewYearsEve
01-Jan-2014
224751
157
#IPL
01-Jun-2014
374769
158
#Carnaval
01-Mar-2014
477170
159
#RIPPhilipSeymourHoffman
02-Feb-2014
50326
160
#SuperBowl
03-Feb-2014
2755775
161
#Oscars
03-Mar-2014
4251433
162
#UmbrellaRevolution
03-Oct-2014
277261
163
#iVoted
04-Nov-2014
28120
164
#Abdicates
05-Jun-2014
274
165
#lndiaVotes
05-Mar-2014
8172
166
#wt20
06-Apr-2014
695129
99
167
#Alia
06-Jan-2015
5528
168
#Wimbledon
06-Jul-2014
907759
169
#USOpen
06-Sep-2014
819176
170
#Sochi2014
07-Feb-2014
5069375
171
#BCSChampionship
07-Jan-2014
273487
172
#WorldCup
08-Jul-2014
20084808
173
#FrenchOpen
08-Jun-2014
115527
174
#Coachella
09-Jan-2014
76383
175
#TDF14
09-Jul-2014
18734
176
#BerlinWall
09-Nov-2014
77477
177
#NYFW
09-Sep-2014
653920
178
#BringBackOurGirls
10-May-2014
3868601
179
#MalalaYousafzai
10-Oct-2014
152659
180
#ThanksLD
10-Oct-2014
103118
181
#Spain2014
10-Sep-2014
953698
182
#RIPRobinWilliams
11-Aug-2014
2757478
183
#ComingHome
11-Jul-2014
29180
184
#CometLanding
12-Nov-2014
570908
185
#GoldenGlobes
13-Jan-2014
1618349
186
#GermanyWins
13-Jul-2014
1468
187
#Ferguson
14-Aug-2014
9921883
188
#TheVoiceAU
14-Jul-2014
221374
189
#StanleyCup
14-Jun-2014
490531
190
#NBAFinals
16-Jun-2014
1333518
191
#Formulal
16-Mar-2014
273901
192
#NBAAllStar
17-Feb-2014
375858
193
#MH17
17-Jul-2014
4044279
194
#OnlyOnTwitter
18-Feb-2015
2707
195
#lceBucket Challenge
19-Aug-2014
22373
196
#BRITAwards
19-Feb-2014
45761
197
#LoveTheatre
19-Nov-2014
46692
198
#lndyRef
19-Sep-2014
5440172
199
#NFLPlayoffs
20-Jan-2014
586196
200
#FirstTweet
20-Mar-2014
357941
100
201
#MarchMadness
21-Mar-2014
1574794
202
#Glasgow20l4
23-Jul-2014
811096
203
#ISS
23-Nov-2014
492621
204
#MuseumWeek
24-Mar-2014
203938
205
#MH370
24-Mar-2014
4837319
206
#Cannes2014
24-May-2014
720491
207
#MarsOrbiter
24-Sep-2014
16032
208
#VMAs
25-Aug-2014
3082229
209
#HeForShe
25-Sep-2014
812374
210
#PhotoshopRF
25-Sep-2014
11479
211
#Emmys
26-Aug-2014
863563
212
#AusOpen
26-Jan-2014
804617
213
#Eleig6es2014
26-Oct-2014
411975
214
#DerekJeter
26-Sep-2014
149596
215
#Grammys
27-Jan-2014
3454718
216
#RIPMaya Angelou
28-May-2014
2705
217
#PutOutYourBats
28-Nov-2014
147452
218
#ModilnAmerica
28-Sep-2014
93461
219
#SOTU
29-Jan-2014
1391601
220
#WorldSeries
30-Oct-2014
1137844
101
102
Appendix B Predicted the Bass model Curves of a
Sampled 8 Topics with only Partial Data Known to the
Model
14 x 1o
14
S10i.
E
5d:
E
E
1
>D 6
true!
-1hr
1d
10
5d
8
6
4
4
E
-
E
2
0
x
14
E
.2
105 Event I - 2Chainz-120days
C
S81
C
0
0
C
:3
x
12
1 hr
l d,
: 12
.
Event I - 2Chainz-90days
truei
50
100
day
00
200
150
2
o5 Event I - 2Chainz-1 50days
105 Event
14
200
150
100
day
50
I - 2Chalnz - 180days
12
12 1
10
1 0(3
8
E
E
8
6
6
Irue
--i1 hr
d
54 d
.......
tru
4E
E
2
E
r2
2i
1
5d
0
0
50
100
day
150
50
0
200
Ever * #AintNobodyGotTmeForThat-90days
Evens
200
150
100
day
6#AintNobodyGotTimeForThat-120days
3,
true
S 2.51 --- 1hr
(3
1d
5d
2
-2~2
idC
E 1.5
E
75
1hr
ild
5d
1.5
E 0.5
50
0
0
50
100
150
-0.5
200
-
-
--------
0
50
100
day
day
103
-
-
-
0.5
-
E
E
true
.
150
200
r>#AntNobodyGotTmeForThat-1 50days EventA2
Event
31
3
0
AintNobodyGotTimeForThat - 180day-
2.5
5 2.5
0
2
E 1.5
true
1
3
E
-true
1
lhr
0.5
0
-5
-
50
0
day
x
17Event
true
- -lhr
1d
C5r
200
150
xIvent 5 - #coolstorybro-120days
5 - #coolstorybro-90days
true
c6
0
100
day
id
5d
S - -- 1hrI
c5
S5d
04
0
50d
0
50
E
3
-
0
.2
E
E
E
I
~~vent~ 5
1
0
50
-
100
day
2
E
-cosoyr -5dy
200
150
100
day
200
150
@vent 5 - #coolstorybro - 180days
7 vent 5 - #coolstorybro-1 50days
6
0
Vid
6
0
c
4
E
S3
4)
Ef
3
-truel
E 2f
0
50
2
Ef
1-hr I
1d
id
E
5d
5d
100
day
150
true'
0
200
50
100
day
104
150
200
......truei
r 2.5
10 Event
x
3 x 10Event 20 - #Swimming-90days
lhrI
true
- -- 1hr
1d
2.5
0
C)Id
E 1.51
E 1.5
E
E
E
S0.5
0.5
0
100
-
0
5d
2
5d
o 2
20 - #Swimming-120days
50
0
200
150
100
day
50
0
day
x
3
0
10Event
20 - #Swimming-150days
C
0
c
a,
0
0
2
E
0)
0
2.5
2
E
a)
Z 1.5
E5
E
E
vent 20 - #Swimming - 180days
3
2.5
200
150
E 1. 5
1
1
1
.true
1hr
0.5
-
0.5.
d
5d
5d
0
-
0
100
day
50
0
150
200
50
.
.1
ihr
1d
0
5d
10000
20 0
true
hr
id
0
150
15000
true
-
100
day
Event 56 - Wikileaks Cablegate-120days
Event 56 - Wikileaks Cablegate-90days
15000
0
true
- 1hr
1d
E
10000
5d
E 5000
-
E 5000
0
0
50
100
day
150
0~
0
200
50
100
day
105
150
200
Event 56 - Wikileaks Cablegate-150days
Event 56 - Wikileaks Cablegate - 180days
150001,
15000!:
C
0
(-3
,10000
0000
E
E
true
75000:
- tru e
5000
E
1 hr
-hr
1d
- -
0
100
day
50
50
0
200
150
4.x15 Event 157- #IPL-150days
3
5d
5d
..
-.....
........
-
.
0
4 rx 105
.....
150
100
day
200
Event 157 - #IPL-1 2 0days
3-
.
E
2
E2
...
-- ...
truel
-- hr
.
E
z
U2
t rue:
hr
j1 d2
5d
E1I
Id
- -...........-.....
0
0
50
100
150
3
0
200
0
day
-
83
200
150
day
Event 157 - #IPL-90days
.Io
4,
< 5d
. ........
-100
50
"'o1
true
hr
1d
Sd
Event 157 -#IPL-180days
4--- -~---- -
E
1
3
E
E2
E
E 2P
. true
--- lhr
Ea
E 1
r
E1
id
5d
0
0
- - --.-.-.-.-..-...
150
100
50
day
0
200
106
50
100
day
150
20 0
Avent 213 - #Eleig6es2Ol4-90days
.- true
hr
4
y
ent 213 - #Eie6es2014-120days
-
...
321
o
2
0
',1
4
o
5d
E0
50
0
100
200
150
day
10
2
day
2)
(3
213 - #Eieig6es2O14 - 180days
5 tnt
o4
~
3
E
21
E 1
~
~~
0
-
0
15
50
100
day
150
200
truel
150
200
true
1hr
1d
0
5d
10
100
day
#vent 214 - #DerekJeter-120days
--- lhr
Id
(3
50
0
x 1 fvent 214 -#DerekJeter-90days
........
...
0
-1 hr
d
5d
o
5d
0
tr ue,
E
E 11
'
E
0
d0
day
nt 213 - #Elei(;6es2O14-1 5Odays
5x
150
E3 10?
5d
E
E
~
>
5
E 5'
EE
0
E
E
0
0
50
100
day
150
0
20 0
107
50
100
day
150
200
15
vent 214 - #DerskJeter-150days
true.
lhr
. ovent 214 - #DerekJeter - 180days
2
c
0
1,
75
0
c
E
E
1.5
4)
E
E
1
true
I hr
E 0.5
E
0
day
108
Id
5d
0
50
100
day
150
20 0