Retweet Waves: Contagion in a Social Network er

advertisement
Retweet Waves:
Contagion in a Social Network
Alistair Tucker∗
Complexity Science DTC, University of Warwick.
(Dated: November 10, 2010)
Twitter is 2010’s preeminent ‘microblogging’ service, the medium used by millions to broadcast
140-character ‘tweets’ to one another. A large subset of its activity is visible to all, providing a
great resource to researchers seeking to understand more about interactions within large groups.
For this investigation we hooked into Twitter’s infrastructure to record tweets as they were made,
obtaining six distinct data sets. Our analysis focuses on the ‘retweet’ wave, a phenomenon in which
the content of a tweet propagates through repetition by other users. We take a ‘top-down’ approach,
hoping to find within our statistics clues as to the description or descriptions most appropriate to
the system. The suggestion of power laws in wave size distribution indicates parallels with other
avalanching systems. We introduce an illustrative mean-field model that displays finite-size scaling,
a possible mechanism by which to explain variations in wave size PDF over time. Attempts to
induce the corresponding curve collapse for observed distributions are however inconclusive in the
absence of more data. We further make measurements of shape and temporal extent of retweet
waves, hoping to uncover probabilistic relations between the size of a wave and the distribution in
time of its component tweets.
I.
INTRODUCTION
If Social Science has traditionally been regarded as a
field handicapped by a lack of hard data, the time has
come to revise that opinion. In the age of the Internetbased social network, there have come online new sources
bearing a wealth of information on human interaction [1].
But what they provide is a very different quantity
to what may be gleaned from the familiar surveys—
typically conducted via questionnaire on samples that are
likely to be small and, without care, unrepresentative. It
is worth considering anew what approaches to the data
might profitably be attempted.
Twitter is one of the new sources, a popular microblogging and social networking service. It has been known to
play a rôle in the propagation of serious news stories [2],
but a recent report found ‘pointless babble’ to be the
largest single category into which activity was deemed to
fall [3]. Celebrity gossip remains its forte.
At first blush then, it might seem a frivolous object
of study. There is the appeal that work on this ‘fun’
topic is capable of generating headlines in outlets with
far greater readership than Physical Review E. But despite its appetite for excitable headlines on Twitter-based
research, the wider public is still unlikely to compare its
value favourably with, say, that of cancer research [4].
In fact it is to be hoped that research into Twitter may
be rewarded with insight into the world beyond Twitter.
There is no class of interaction occurring within this system that is hidden, so we might hope for analyses less
speculative than those we must engage in for more ‘black
box’-like systems. But it is possible that what we learn
∗
My thanks to supervisor Duncan Robertson, also to Profs. Sandra Chapman and Robin Ball for their insightful suggestions.
may be reapplied to such systems, enhancing our understanding of social behaviours in general or, further,
Complexity as a whole.
It is an insight of that field that systems apparently
dissimilar on the small scale may have deep connections
on the large scale. (And since a cancer also qualifies as a
complex system, perhaps there really is the prospect of
some benefit to medical research!)
Less far off, we can see many reasons why it is important to understand how people interact and come to
consensus. International development and environmental
protection are examples of fields in which it is often found
that top-down policy dicta fail to have the desired effect.
There might be some awareness of a failure to engage local populations or to plan correctly for the dissemination
of knowledge, but it is generally unclear how to ameliorate the situation. Those charged with achieving such a
change of culture would be aided by a sound grasp of just
how change is seeded and established in social systems.
Financial market regulators suffer with related problems. Recent times in particular have seen them appear
impotent in the face of market forces they can only pretend to understand. What we can say is that these forces
arise from the interplay of the many parts of a complex
social system, even though lines of interaction may be
murky.
A term often used in accounts of the credit crunch of
2007-2008 is ‘contagion’, referring to the way in which
cashflow disruption, or even just perception of a decline
in creditworthiness, was seen to spread like an infection
from corporate entity to corporate entity. We shall see
that such epidemiological metaphors are also highly applicable to activity in the Twittersphere.
For some there is much of interest in Twitter for its
own sake. Going back a long way, news organisations
have sought dialogue with their audience, traditionally
through the letters page, more recently through the com-
2
ment box. Twitter opens up a new frontier. But in their
interest we might also see a defensive posture. After
all Twitter is potentially another prong of the Internet’s
feared assault on traditional media.
And naturally marketing organisations are interested
in what looks like a new way of reaching a large
(and rapidly increasing) audience. The ‘viral’ campaign
is something of a Holy Grail to modern marketers—
advertisement so beautifully designed that it spreads of
its own accord. Wieden + Kennedy has enjoyed success this summer with its much lauded campaign for Old
Spice. There is even evidence that it actually increased
sales of the smelly stuff.
For any company or individual with a brand to build
or protect, this is serious business. Relevant questions
include: does good news travel faster than bad news?
and what are the qualities of a viral tweet?
Most high-minded of all, we may speculate that an
understanding of how the Twittersphere thinks may help
us to answer the deepest questions about the thought
processes of society itself. How do ideas spread? How do
they resonate or cohere or find beauty with what has gone
before? Is this a process without end? Is it meaningful
to talk about progress in thought?
Philosophers have traditionally been more interested
in the psychology of the individual when it comes to understanding how we perceive the world. But the lens
through which we look is arguably as coloured by the
collective experience. Philosophy itself may be regarded
as a set of ideas that filter down to influence everyone’s
perception.
We might describe the focus of this project as the
spread of ‘memes’ (a buzzword used to describe ideas that
infect society rather in the manner of viruses [5]). For as
long as humans have thought, they have wondered how
thoughts arise. The individual mind may remain mysterious, but we may now be able to make great progress in
understanding the collective mind.
Exploration of this new data source is also attractive
because, as new territory, it remains largely unmapped.
We aim to approach the subject in as ‘top-down’ and
as model-free a manner as possible. By beginning with
objective measurements and statistics, we hope finally to
come to a qualitative understanding of what descriptions
might fit the system.
II.
THE DATA SOURCE
Twitter is a web- and mobile phone-based service that
permits users to broadcast tweets of up to 140 characters
to a band of self-selected followers.
The network of followers and followed forms a directed
graph. Statistics relating to the structure of the graph
have been presented elsewhere [6]. The relationship between numbers of followers and followed at a node is generally asymmetric.
The follower mechanism is probably the major channel
through which users find tweets that are of interest to
them. But it is not the only one—searches and ‘trending
topics’ also play a part.
Retweets and hashtags are features of Twitter helpful
to us because they allow us to proceed without attempting full textual analysis. Interestingly both innovations
came from the user community itself, and were only introduced as formally supported features at a later date.
Hashtags, being simply a word or combination of words
preceded by the # symbol, are used as a means of labelling individual tweets according to a theme. Hashtags
can potentially develop in popularity over a lengthy period of time. Constantly revitalised by new contributions,
like a mutating cold virus, there need be no upper limit
on the number of times a hashtag spreads and reinfects.
We might study quantities that have direct parallels to
those studied in models of infectious diseases, for example the critical threshold of infectiousness at which the
disease never dies out.
But we focus on retweets, where users simply copy
tweets they like to their own followers, including the code
RT and an attribution. Since a genuine retweet (one that
leaves its content unmodified) will be made once at most
by each user, there is an upper limit on the extent of
its spread (’though not necessarily an upper limit to the
time over which it occurs).
The timescale over which retweet waves play out is
smaller than that of hashtags. For this reason they are
easier to work with, given our data sets of limited duration, because we may be more confident that we have
observed them in (close to) their entirety.
Although we cannot identify everyone with whom an
idea resonates, we can identify the subset with whom
it resonates to such an extent that they are moved to
retweet it. To model the process of retweeting on a fine
scale would be hard. Perhaps we might simplify it thus:
Each user checks Twitter at intervals assumed to be
random according to some time-dependent rate, a function that is individual to the user but that might sensibly
be linked to timezone. Each time they check, they will
be faced with a list of tweets made by those they follow. Unless they check quite religiously, there is a good
chance that there will have been made more than a page’s
worth of tweets since they last checked. It is likely that
those made longer ago will never be read; certainly the
chance of their being retweeted must decline with time.
But if something is retweeted again by another of those
followed, it will rise once more to the top. Thus the more
retweets, the more likely it is to be read by this user, who
may in turn be tempted to retweet it again.
Twitter also provides the service to mobile phones,
possibly a different dynamic again. Also recall that people may retweet tweets that they find through searches
or by following the trending topics of the moment.
So an accurate microscopic model looks difficult to construct. We have to hope that we will be able find things
to measure that are not overly affected by the simplifications we shall be forced to make.
3
TABLE I. Sets of data in this study, the collection periods and ‘track keywords’ used to filter the stream.
LABOUR
2010-05-06 15:03 to 2010-05-07 21:30 (98 875 tweets)
‘Labour’.
ELECTION
2010-05-07 21:30 to 2010-05-11 20:53 (405 255 tweets)
‘Labour’, ‘Conservative’, ‘Conservatives’, ‘Tory’, ‘Tories’, ‘Liberal’, ‘Lib’, ‘Cameron’, ‘Clegg’, ‘Cleggy’, ‘Weggy’.
COALITION
2010-05-12 16:41 to 2010-05-16 08:07 (101 512 tweets)
‘Con/Dem’, ‘Con-Lib’, ‘Lib-Con’, ‘Cleggeron’, ‘Cameron/Clegg’, ‘Conservatives’, ‘Lib’, ‘Liberal’, ‘LibDem’, ‘LibDems’,
‘Cameron’, ‘Clegg’, ‘Cleggy’, ‘Weggy’.
BUDGET
2010-06-22 12:59 to 2010-06-28 22:46 (149 569 tweets)
‘Budget’, ‘Osborne’, ‘Deficit’, ‘Debt’, ‘Tax’, ‘Cuts’, ‘Pensions’, ‘Benefits’, ‘Employment’, ‘Unemployment’, ‘NHS’, ‘Defence’,
‘Banks’, ‘Green’, ‘Coalition’, ‘Aid’, ‘EU’, ‘Universities’, ‘Schools’, ‘Progressive’.
BB11
2010-07-28 19:09 to 2010-08-03 23:24 (105 706 tweets)
‘#BB11’
EDFRINGE
2010-08-09 19:48 to 2010-08-25 03:09 (40 000 tweets)
‘#EdFest’, ‘#EdFringe’, ‘Edinburgh Festival’, ‘Edinburgh Fringe’, ‘Ed Fringe’.
A.
Data Acquisition
Twitter is primarily a web-based service. As such its
output is consumed principally by users browsing HTML
web pages. It would be arduous (if possible) to extract
the information we need directly from the HTML. So it is
fortunate that Twitter makes available APIs that we can
use. These are mostly aimed at third-party applications
and web sites.
There are three separate APIs, the REST API, the
Search API and the Streaming API. All have limits on
the amount of data one may draw out, although it is possible to apply for enhanced access (whitelisting). For this
investigation I chose to use the Streaming API, estimating that it would yield the greatest quantities.
It is necessary to filter the output using ‘track keywords’ in order to limit the amount coming in to a level
that Twitter will allow. Some care is required in tuning
these so as to collect as much as possible without triggering a ‘track limitation’ notice. Despite our attention,
it was indeed such a notice that brought an end to four
out of six of our collections.
I built our client using open-source components: a
Java library Twitter4J whose methods I call from Javafriendly Lisp Clojure [7]. Please contact me for a copy of
the code used.
Clojure’s read-eval-print-loop (REPL) was invaluable
in providing an interactive approach to data collection
and exploration. MATLAB was used for further processing and exploration and for the drawing of the various
charts contained herein.
Twitter is something of a moving target. It redesigned
its web interface during the course of these investigations
and also made changes to the Streaming API we were
using.
It recently donated its archive to the US Library of
Congress and also made its historical activity available
for search via Google. At the time of writing, there seems
still to be no way to take advantage of the Library’s acquisition. And although Google does provide APIs for
several of its search functions, unfortunately the Twitter
results seem only to be available via browser-rendered
HTML.
III.
ANALYSIS AND RESULTS
Let us call a ‘wave’ the sequence (or avalanche) of
retweets that proceeds from a single tweet.
The initial representation of a wave to come from the
data is the set of times at which tweet and retweets occur.
We may bin the data on the time axis to recover a rate
function. Alternatively waiting time between tweets may
be regarded as a function of tweet number (sometimes the
more natural, and parsimonious, approach).
Either way, a wave can be said to have some characteristic size, shape and timescale.
To establish relationships in probability between these
three is our aim. We should like this to give us insight into the type of model appropriate to this domain.
Even leaving such hopes to one side, it would be an
achievement to learn something of the circumstances un-
4
der which a tweet is likely to garner most exposure.
Our distributions need not be independent of time or
track keyword, but we ought still to be able to establish
relations.
A.
Distribution of Wave Size
In physical models of sandpiles, ferromagnets and the
like, it is common to study avalanches. It is generally
assumed that there is a separation of timescales between
the one at which the system is driven and the one at
which the avalanches play out. Effectively avalanches
are supposed to occur instantaneously.
A wave of retweets is somewhat analogous to an
avalanche. The size of the wave we define as the number
of individuals who retweet, just as the size of an avalanche
is the number of sites that change state.
In practice retweet waves are not instantaneous. In any
set of measurements they will necessarily be truncated by
the moment at which the measurements were made. A
consequence is that we can never be sure that a wave has
been recorded in its entirety. (And it is not clear that
suitable models would admit that a wave ever comes to
a complete stop.)
Nevertheless we can and do approximate wave size by
the number of retweets we count within our sample. If
the resultant error is significant, we ought to be able to
observe its effect in the difference between the distribution of waves recorded at the beginning of the sample
and the distribution of waves recorded at the end.
Where S is the random variable that stands for the
size of some wave, let
f (s) = Pr(S = s)
F (s) = Pr(S ≥ s) =
∞
X
f (r)
r=s
It is instructive to plot on a log-log scale the estimate of
F that is the rank-ordered plot of recorded data (Figs. 1
to 6).
There is in those figures the suggestion of a straight
line. That would imply a power law, a symmetry across
scales. Claims of power laws have been made with enthusiasm in many spheres, and frequently in social phenomena, as their existence can be evidence for certain
underlying mechanisms [8].
However it is difficult to demonstrate empirically the
existence of a power law. A methodical reworking of
previous claims concludes that many are unfounded [9].
With the limited number of data that we currently have,
we would need to appeal to a priori reasons to conclude
a power law. In a later section, III A 2, we attempt to
establish whether such reasons might exist.
1.
Time Dependence
Each one of Figs. 1 to 6 depicts the distribution derived from the set of tweets made over the whole of the
(respective) collection period. The question that these
plots address is, “What is the probability that a wave
picked at random from the collection period has size s?”
As such, each of those distributions may be regarded as
a weighted average of distributions for each of a number
of equal-size time sections that partition the collection
period. (Alternatively it might be regarded as an unweighted average of distributions relating to each of a
number of equal-size measure sections that partition the
total collection of tweets.)
It is to be assumed that there exists a size for time
(or measure) sections sufficiently small that within it the
process under investigation may be regarded as stationary in important senses.
System size, for example, is a quantity that varies
greatly throughout the day as people log into and out
of Twitter. But for a sufficiently small period of time we
can surely regard it as constant.
If we are to partition the data sets in time, we face
again the practical issue that waves do not occur as instantaneously as we would like to pretend. It is not entirely clear how to assign waves to time sections when
many sprawl across section boundaries.
In a world of instantaneous waves it must be the case
that rτ (s), the rate at which waves of size s occur during time section τ (assumed constant over that section),
obeys
s rτ (s) =
1
nτ (s)
|τ |
∀s ∈ N
where nτ (s) is the number of tweets counted within the
interval τ that belong to a wave of size s.
By defining rτ (s) in terms of nτ (s) according to this
equation, we maintain continuity as we move to the
real-world situation of temporally extended waves. It
amounts to the assignment of fractions of a wave to different sections.
A simpler approach might be to assign a wave in its
entirety to the section in which its first tweet appears.
But for waves of longer duration in particular, this hasn’t
always felt quite right.
Figs. 7 through 12 show total rate functions for waves
and tweets respectively,
X
X
RW =
r(s)
RT =
s r(s)
s
s
where r(s) has been estimated for each time section as
described above. As one might expect, these measures
vary substantially over time, most obviously on a diurnal
basis.
5
0
0
10
10
ï1
10
ï2
10
ï2
10
ï4
ï3
f(s)
F(s)
10
10
ï6
10
ï4
10
ï8
10
ï5
10
ï6
10
ï10
0
1
10
2
10
10
10
3
10
0
1
10
(a)
2
10
s
10
3
10
s
CDF: F (s) = Pr(S ≥ s)
(b)
PDF: f (s) = −F 0 (s)
FIG. 1. LABOUR: distribution of wave size S.
0
0
10
10
ï1
10
ï2
10
ï2
10
ï4
f(s)
F(s)
10
ï3
10
ï6
10
ï4
10
ï8
10
ï5
10
ï6
10
ï10
0
1
10
2
10
10
10
3
10
0
1
10
(a)
2
10
s
10
3
10
s
CDF: F (s) = Pr(S ≥ s)
(b)
PDF: f (s) = −F 0 (s)
FIG. 2. ELECTION: distribution of wave size S.
0
0
10
10
ï1
10
ï2
10
ï2
10
ï4
f(s)
F(s)
10
ï3
10
ï6
10
ï4
10
ï8
10
ï5
10
ï6
10
ï10
0
1
10
2
10
10
3
10
10
0
1
10
(a)
2
10
s
10
s
CDF: F (s) = Pr(S ≥ s)
(b)
PDF: f (s) = −F 0 (s)
FIG. 3. COALITION: distribution of wave size S.
3
10
6
0
0
10
10
ï1
10
ï2
10
ï2
10
ï4
ï3
f(s)
F(s)
10
10
ï6
10
ï4
10
ï8
10
ï5
10
ï6
10
ï10
0
1
10
2
10
10
10
3
10
0
1
10
(a)
2
10
s
10
3
10
s
CDF: F (s) = Pr(S ≥ s)
(b)
PDF: f (s) = −F 0 (s)
FIG. 4. BUDGET: distribution of wave size S.
0
0
10
10
ï1
10
ï2
10
ï2
10
ï4
f(s)
F(s)
10
ï3
10
ï6
10
ï4
10
ï8
10
ï5
10
ï6
10
ï10
0
1
10
2
10
10
10
3
10
0
1
10
(a)
2
10
s
10
3
10
s
CDF: F (s) = Pr(S ≥ s)
(b)
PDF: f (s) = −F 0 (s)
FIG. 5. BB11: distribution of wave size S.
0
0
10
10
ï1
10
ï2
10
ï2
10
ï4
f(s)
F(s)
10
ï3
10
ï6
10
ï4
10
ï8
10
ï5
10
ï6
10
ï10
0
1
10
2
10
10
3
10
10
0
1
10
(a)
2
10
s
10
s
CDF: F (s) = Pr(S ≥ s)
(b)
PDF: f (s) = −F 0 (s)
FIG. 6. EDFRINGE: distribution of wave size S.
3
10
7
RT
RT
RW
RW
5
10
5
10
4
10
4
10
18:00
00:00
FIG. 7. LABOUR.
06:00
12:00
Thu 06 / Fri 07 May
RW =
P
r(s)
s
18:00
RT =
Wed 23 Jun Thu 24 Jun Fri 25 Jun Sat 26 Jun Sun 27 Jun Mon 28 Jun
P
s r(s).
s
6
FIG. 10. BUDGET.
RW =
P
s
r(s)
RT =
P
s r(s).
s
6
10
10
RT
RT
RW
RW
5
10
5
4
10
10
3
10
4
10
Sat 08 May
2
Sun 09 May
FIG. 8. ELECTION.
Mon 10 May
RW =
P
s
10
Thu 29 Jul
Tue 11 May
r(s)
RT =
P
s
s r(s).
5
Fri 30 Jul
FIG. 11. BB11.
Sat 31 Jul Sun 01 Aug Mon 02 Aug Tue 03 Aug
RW =
P
s
r(s)
RT =
P
s
s r(s).
4
10
10
RT
RW
3
10
4
10
RT
RW
2
Thu 13 May
Fri 14 May
FIG. 9. COALITION.
RW =
Sat 15 May
P
s
r(s)
10
Sun 16 May
RT =
P
s
s r(s).
Sun 15 Aug
FIG. 12. EDFRINGE.
RW =
Sun 22 Aug
P
s
r(s)
RT =
P
s
s r(s).
3
4.5
2.8
4
2.6
3.5
alpha
alpha
8
2.4
3
2.5
2.2
2
2
1.8
Sat 08 May
Sun 09 May
Mon 10 May
1.5
Thu 29 Jul
Tue 11 May
FIG. 13. ELECTION: likelihood of exponent α over time
(80% confidence).
Fri 30 Jul
Sat 31 Jul Sun 01 Aug Mon 02 Aug Tue 03 Aug
FIG. 16. BB11: likelihood of exponent α over time (80%
confidence).
3.6
3.4
3.2
alpha
3
2.8
2.6
2.4
2.2
2
1.8
Thu 13 May
Fri 14 May
Sat 15 May
Sun 16 May
3
2.9
FIG. 14. COALITION: likelihood of exponent α over time
(80% confidence).
2.8
alpha
2.7
2.6
3.6
2.5
3.4
2.4
3.2
2.3
alpha
3
2.2
2.8
Sun 15 Aug
Sun 22 Aug
2.6
2.4
FIG. 17. EDFRINGE: likelihood of exponent α over time
(80% confidence).
2.2
2
1.8
Wed 23 Jun Thu 24 Jun Fri 25 Jun Sat 26 Jun Sun 27 Jun Mon 28 Jun
FIG. 15. BUDGET: likelihood of exponent α over time (80%
confidence).
9
Although the rates are highly time-dependent, that
does not imply that the normalised PDF,
rτ (s)
fτ (s) = P
0
s rτ (s )
need also vary with time. However constant fτ (s) would
imply parallel lines (to within statistical error) for RT
and RW in Figs. 7 to 12. At a glance, that is not obviously the case.
We may examine more closely the question of time
dependence of fτ (s) using a statistic that seems natural
given the semblance of a power law, namely the likely
exponent ατ in the assumed relation,
1
fτ (s) =
s−ατ
ζ(ατ )
which gives us that for data set Dτ associated with time
section τ ,
Pr(Dτ |α) =
|Dτ |
Y
1
e−α log sj
|D
|
τ
ζ(α)
j=1
We are told that the exponent of a power law is best
estimated using the Maximum Likelihood (ML) procedure [9]. It is simple to perform this task numerically,
basing results on the log-likelihood function,
Lτ (α) = log Pr(Dτ |α)
∞ ∞ X
X
1
1
= −α
nτ (s) log s − log ζ(α)
nτ (s)
s
s
s=1
s=1
This function need only be maximised to recover the ML
estimate. The red crosses in Figs. 13 through 17 correspond to the modes of Lτ (α).
But by normalising, we can use the log-likelihood function to construct a distribution over α,
But we might hope still to be able to model it. One
approach is to regard the variation in our estimator as the
consequence of finite-size effects that distort the shape of
fτ (s) (now strictly a power law only in the infinite-size
limit) as the system size changes over time.
There is the hint of a ‘roll-off’ in the tails of our distributions (Figs. 1 to 6). This is a characteristic of finite-size
scaling, illustrated by the simple mean-field model that
we introduce next.
2.
Inspired by the parallels between retweet waves
and mean-field models of avalanches that manifest as
Barkhausen noise in ferromagnets [10], we introduce a
simple theoretical model with which we can illustrate
possible approaches to real data.
At best it can be said to approximate only one small
aspect of the process. So although we begin by specifying an ‘external’ field H, we are not requiring that it be
external to the system, merely noting that its source is
external to our model.
H refers to a pressure exerted on the community in a
particular direction of ‘thought-space’. Its effect is felt
at the point where it reaches a level such that some individual, the one most predisposed to do so, is moved to
articulate it.
That event serves to strengthen the field as exerted on
the rest of the community. Depending on their predisposition to do so, measured by a random field fi associated with each, others may also respond by expressing
the same meme. This in turn will increase the field yet
further.
Avalanche Size PDF
0
10
e Lτ (α)
p(α|Dτ ) = P
Lτ (α0 )
α0 e
L=99998
L=9998
L=998
ï1
10
ï2
10
ï3
10
D(s,L)
In general it will not be possible to perform this calculation directly on the computer (at least in MATLAB)
because of machine limitations to floating point numbers.
However the issue may be sidestepped by transforming
our equation thus,
ï4
10
ï5
10
ï6
10
e Lτ (α)−mτ
p(α|Dτ ) = P
Lτ (α0 )−mτ
α0 e
where
Illustrative Model of Meme Spread
ï7
10
mτ = max Lτ (α).
α
ï8
10
0
10
From the normalised distribution p(α|Dτ ) we easily derive median and error bars, as depicted in blue in Figs. 13
through 17.
It does seem clear that the variations exceed what one
can expect through error alone, and therefore that fτ (s)
may not be considered constant in time.
1
10
2
3
10
10
4
10
5
10
s
FIG. 18. The PDF of avalanches in the illustrative model
for systems of three different sizes L. It is clear that in the
infinite-size limit we would see a strict power law with exponent α = 32 .
10
Specifically (and with only limited justification) we
imagine a subpopulation of L susceptible individuals, having their predisposition fi distributed uniformly
across an interval of length L.
Avalanche Size CDF
0
10
L=99998
L=9998
L=998
ï1
10
ï2
10
fi ∼ U (H, H + L)
ï3
10
PL(S* s)
and stipulate that the meme is repeated by any individual
whose predisposition satisfies
X
fi ≤ H +
sj
ï4
10
ï5
10
j
ï6
with si denoting the state of an individual i, a switch
from 0 to 1 if and when that individual responds.
The chain of repetition may be regarded as an
avalanche. The distribution of avalanche size is plotted
in Fig. 18.
The distributions are self-similar and we recognise
finite-size scaling [11], expressed thus in terms of L,
D(S, L) = S
− 32
D(SL−1 )
3
Therefore we may plot S 2 D against SL−1 to see the
curve collapse (Fig. 19).
Curve Collapse
2.5
L=99998
L=9998
L=998
D(s,L) / s
ï3/2
2
1.5
1
0.5
0
0
0.1
0.2
0.3
0.4
0.5
Lï1 s
0.6
0.7
0.8
0.9
1
FIG. 19. The PDFs of systems of different size are self-similar,
as can be seen from this curve collapse.
Fig. 20 shows a slightly different situation. Each of the
lines relates to a distribution that is effectively an average over a period during which the system size randomly
varies between its maximum and a tenth of that value.
This has the effect of smoothing the ‘roll-off’ visible in
the CDF, somewhat reminiscent of what we see in the
real Twitter data.
3.
Finite Size Scaling in the Data
The ELECTION data set, the largest and the busiest
we have, is the natural choice to work with. Our hypothesis is that as the size of the system varies over the
10
ï7
10
ï8
10
0
10
1
10
2
3
10
10
4
10
5
10
s
FIG. 20. The CDF of avalanches in the illustrative model for
systems of different maximum size. Each distribution averages over a time period during which the system size varies
between its maximum value and a tenth of that.
period, we shall see finite-size scaling in the distribution
of wave size. In that case we should be able to perform
a curve collapse with real data just as we did in Fig. 19
with simulated data.
In order to get there, we shall need to divide the period
into sections sufficiently small that we can assume system
parameters such as size to be constant (or substantially
reduced in variability).
Unfortunately we have no good measure for system
size, especially as time sections become short. It is impossible to know who is logged in to Twitter unless they
engage in tweeting.
So we just count the number of unique users tweeting
during a time section τ and call that the system size L.
This quantity is perhaps not a particularly valuable one;
the number of tweets itself might have served just as well
as a measure of system size.
Fig. 21 shows the situation where the period has been
split into ten equal-size time sections. Fig. 22 shows how
data collapse might be attempted.
Both plots are messy and jagged even with as few as
ten sections. It would be unduly optimistic to hope to
perform a meaningful curve collapse.
But our aim in partitioning the time period is to end up
with time sections in which system parameters (such as
system size) vary to only a fraction of the degree to which
they vary over the whole period. Fig. 13 suggests that at
least fifty time sections would be required to achieve this.
If it is hard to imagine success with only ten sections,
then it must be virtually impossible for fifty.
Smoothing the CDFs is one approach that might make
them easier to work with. Instead we turn to the PDFs
and, with a similar intention, bin data points. In some
ways, the PDF is a more suitable quantity to deal with. It
is perhaps more intuitive, and is also free of the spurious
11
0
0
10
10
ï1
10
ï2
10
ï2
10
ï3
ï4
10
fo(s)
Pr(S>=s)
10
ï4
10
L=16670
L=19672
L=17195
L=9203
L=15864
L=11104
L=17063
L=37768
L=15565
L=59965
ï5
10
ï6
10
ï7
10
ï8
10
L=16670
L=19672
L=17195
L=9203
L=15864
L=11104
L=17063
L=37768
L=15565
L=59965
ï6
10
ï8
10
ï10
0
1
10
2
10
10
3
10
10
0
1
10
2
10
s
10
3
10
s
FIG. 21. The CDFs of ELECTION wave sizes under a time
partition of ten equal-size sections.
FIG. 23. The PDFs of ELECTION wave sizes under a time
partition of ten equal-size secions.
0
10
0
10
ï2
10
ï1
ï4
10
f(s)
Pr(S>=s) / sïa/c
10
ï6
10
ï2
10
L=16670
L=19672
L=17195
L=9203
L=15864
L=11104
L=17063
L=37768
L=15565
L=59965
ï3
10
ï4
10
ï2
10
230 < L ) 455
455 < L ) 898
898 < L ) 1772
1772 < L ) 3498
3498 < L ) 6905
6905 < L ) 13630
13630 < L ) 26903
ï8
10
ï10
10
0
1
10
2
10
10
3
10
s
ï1
10
0
10
Lïcs
1
10
2
10
FIG. 22. The same collection of CDFs scaled in such a way
as might have been hoped to induce curve collapse.
FIG. 24. The PDFs of ELECTION wave sizes under a time
partition of 200 sections, subsequently aggregated according
to L.
B.
correlations that one sees in CDF plots.
Fig. 23 suggests that despite the binning, it will still
be hard to make much of the data, even with as few as
ten sections.
Our final attempt, shown in Fig. 24, is more promising.
Here we have partitioned the time period into many (200)
sections, so that it may be assumed that system parameters within a section are almost constant. But we have
then aggregated the sections according to each’s value of
L (for which ‘system size’ becomes an increasingly inaccurate description as the sections become small).
Individual Waves in Time
In the previous section III A we clung to the convenient fiction that retweet waves occur instantaneously.
For some purposes, where we have clear separation of
timescales, it may be reasonable to make this simplification.
But there does exist some timescale (however short)
over which a wave develops and decays. Behavour at
this scale may yield further clues as to the nature of the
process we investigate.
Naturally the statistics for individual waves are subject
to a level of noise. By basing our analysis on the whole
ensemble of wave observations, we hope to mitigate the
effects of randomness.
12
But it can be difficult to know how to ‘average’ over
waves. To do so we require some minimal model relating
one wave to another. We make a start in parametrising
that model in the next part III B 1.
Examination by eye suggests that a large wave might
reasonably be viewed as a superposition of a number of
smaller waves, each triggered by a retweet made by one
of a relatively small number of influential individuals.
1.
Relationship between Wave Size and Timescale
It might be supposed that, on average, the time envelope associated with a large wave (one with more participants) will have greater extent than one associated with
a small wave. That would be the natural consequence of
the time taken for the ‘contagion’ to spread from node to
node.
In order to test this hypothesis, we need to take a
measurement of each wave that encapsulates its inherent
timescale. We should like to average these measurements
over waves of the same or similar sizes, then to compare
those averages between the sizes.
But many measurements will depend on wave size in a
trivial way, and will result in the suggestion of a spurious
relation.
To pin this issue down, we posit that for each wave
there exists an underlying distribution in time (its envelope) according to which its tweets are distributed. Then
the parametrisation of one such distribution can be compared with that of another, without the precise number
of tweets manifested having direct effect.
Of course it is not possible directly to observe the parameters of these underlying distributions since we have
only a sample from each, the observed tweets. The best
that we can hope for is a measurement that constitutes an
unbiased estimator of an underlying parameter; that is,
one whose expectation is equal to the underlying value.
The Law of Total Expectation tells us that
h i
h
h
ii
EV̂ V̂ = EV EV̂ |V V̂ |V
where V̂ is our estimator for parameter V of the underlying distribution. An unbiased estimator will further
obey
h
i
EV̂ |V V̂ |V = V
so that
h i
h i
EV̂ V̂ = EV V = µV
We have such an estimator in
n
V̂ =
1 X
(xj − x̄)2 .
n − 1 j=1
The expectation of V̂ is the variance V (in units of
time squared) of the underlying distribution. Importantly that statement is independent of any assumption
about that distribution’s shape.
Then for a collection {v̂k | k = 1, . . . , m} of such estimates, the Central Limit Theorem applies for large m,
Pm
k=1 v̂k − mµV
√
∼ N (0, 1) .
σV̂ m
where µV is the mean variance of underlying distribution,
a measure of the average wave timescale. Thus we are
led to the following expression for the likelihood of that
mean,
!
m
1 X
1 2
µV ∼ N
v̂k , σV̂
m
m
k=1
So by replacing σV̂2 with estimator σ̂V̂2 , calculated from
the sample, we may put error bars on our estimates of
µV . We take this approach with the Twitter data, binned
according to wave size, and are rewarded with Figs. 25
to 29.
It is important to note that a small value of m undermines both the validity of our use of the Central Limit
Theorem and the accuracy of our estimator σ̂V̂2 . Some of
the larger error bars in the figures indicate exactly that,
and should therefore be taken with a pinch of salt. The
whole of the EDFRINGE plot, for example, is dubious.
However most of the other sets of results look reasonable. Remember that with 80% confidence intervals, we
expect the true value of µV to lie outside the error bars
on about one in five occasions.
The dependence of µV on wave size is not at all dramatic, suggesting that the effect of delay as the thing
spreads from node to node is not as important as one
might have thought.
We also have substantial variation in the magnitude
of µV according to the data set. Of course they are not
really comparable, being taken from different populations
at different times. Again a partitioning of the period into
smaller time sections might permit comparisons.
0.25
0.25
0.2
0.2
Mean µV of wave variance (days2)
Mean µV of wave variance (days2)
13
0.15
0.1
0.05
0
0
10
0.15
0.1
0.05
1
2
10
10
0
0
10
3
10
1
2
10
Wave size s
10
3
10
Wave size s
FIG. 25. ELECTION: dependence of wave timescale on wave
size (80% likelihood).
FIG. 28. BB11: dependence of wave timescale on wave size
(80% likelihood).
0.25
Mean µV of wave variance (days2)
0.2
0.15
0.1
0.05
0
0
10
1
2
10
10
3
10
Wave size s
0.25
0.25
Mean µV of wave variance (days2)
0.2
0.2
Mean µV of wave variance (days2)
FIG. 26. COALITION: dependence of wave timescale on wave
size (80% likelihood).
0.15
0.1
0.05
0.15
0
0
10
1
2
10
10
3
10
Wave size s
0.1
FIG. 29. EDFRINGE: dependence of wave timescale on wave
size (80% likelihood).
0.05
0
0
10
1
2
10
10
3
10
Wave size s
FIG. 27. BUDGET: dependence of wave timescale on wave
size (80% likelihood).
14
Individual Wave Development and Decay
In the absence of a solid framework by which to perform averages over many waves, we ask what we can learn
from individual examples.
Figs. 30 and 31 depict the development and decay of
the two largest waves of the ELECTION data set.
@Jason at DAVID: Nick Clegg has changed his Facebook
relationship status to: “it’s complicated.”
@stephenfry: LibDem/Tory Rule! All will now be well.
Justice prosperity kindness happiness for all! Yay! What
could be better? #heavyhandedirony
In these plots we have inverted the usual relationship
of measure µ and time t. It is then easy to produce error
bars for the mean waiting time τ (in a similar procedure
to that employed in section III B 1).
The simplicity of the plots, virtually a straight line in
long sections (those free of influential retweets), suggests
a law for the decay of waves.
Suppose we have a period over which waiting time τ
varies as
"Its Complicated" (90% confidence)
9
8
7
6
log waiting time
2.
5
4
3
2
1
0
0
200
400
600
800
1000
retweet number
1200
1400
1600
FIG. 30. Decay of the largest wave of ELECTION (90% likelihood).
log τ = α + βµ
Then
t = eα 1 + eβ + e2β + · · · + e(µ−1)β
eβ t = eα eβ + e2β + · · · + e(µ−1)β + eµβ
so
eβ − 1 t = eµβ − 1 eα
#heavyhandedirony (90% confidence)
4.5
We have
4
1
log 1 + e−α eβ − 1 t
β
α
−1
e−α eβ − 1
dµ
e
−1
=
=β
+t
dt
β (1 + e−α (eβ − 1) t)
eβ − 1
3
log waiting time
µ=
3.5
2.5
2
1.5
1
0.5
0
ï0.5
0
100
200
300
retweet number
400
500
600
FIG. 31. Decay of the second-largest wave of ELECTION
(90% likelihood).
15
The largest single wave contained within the ELECTION data set is made up of 584 retweets, each saying,
RT @Jason at DAVID: Nick Clegg has changed his Facebook relationship status to: “it’s complicated.”
But another 970 tweets contain the character strings
‘Nick Clegg’, ‘Facebook’, ‘relationship status’ and ‘complicated’. All express the same joke.
Many of those 970 do attribute Jason at DAVID, or
retain exactly the same phrasing, suggesting that they
do derive from the same source, but somehow became
separated from the main wave. (There are other ways
of retweeting that the system will not recognise, copypasting being the simplest.)
However it does look as though some of them may have
originated independently. Of course it is hard to be sure.
But I am prepared to believe that they came about
independently, given that the joke was, perhaps, a relatively obvious one to make at that time. It is sometimes the case in the sciences, in parallel, that different
researchers will come up with the same ideas independently. (Not so much the great and unexpected ideas
perhaps, but those that are a natural step from what has
gone before.)
Perhaps there was, in some sense, a certain inevitability to this tweet given the time and the situation. I like
to think that this lends credibility to the idea of a ‘pressure’ in a certain direction in ‘thought-space’ (as in our
toy model) that somebody somewhere is bound to articulate.
Where might a model such as that lead us? For every
one of the large (but finite) number of tweets that it is
possible to make, each individual has a field fi associated,
representing its disposition to tweet or retweet. Thus we
locate each individual at a point in a space whose axes
are the possible tweets.
There are bound to be strong correlations between
an individual’s coordinates on the various axes. With
enough data we might even hope to perform some algorithm such as PCA to reduce the effective dimensionality
to a manageable number. We would have uncovered a
workable definition of ‘thought-space’ !
It may not be assumed that an individual will stay
still at any one location in this space. We might then
ask how does its movement relate to activity within the
Twittersphere. But it is naı̈ve to treat Twitter as a system in isolation—perhaps it might be better viewed as a
visible subset of the scaled-up system that is society and
society’s thought.
In our data sets the information already exists that
would permit us to identify individuals within waves, and
to compare and contrast their activity across waves. To
make use of that is likely to be an important step forward.
It is also possible to download the follower graph itself,
and to use this would vastly improve our knowledge of
the paths taken by waves through the network. In this
investigation we have made no consideration of the impact of network structure, although this might have been
thought to be one of the more obvious lines of enquiry.
But we have seen that our evidence for a timescale increasing with tweet size verges on the insignificant. And
other work has shown that in general the paths are short
[6].
We should try to measure network effects and establish
somehow their importance. The simplest model is going
to remain the fully connected network, and it is not clear
that it is network effects that are the biggest threat to
this model’s credibility.
Another obvious path might be textual analysis, although this is potentially a complicated one to take. It
is always possible though that a model of natural language will benefit as much from its link with a model of
Twitter as vice versa.
We suffered in this project from a lack of data. I had
believed that I had collected quite a lot from the Twitter
feed, but ultimately I found myself attempting analyses
that wanted more. (Perhaps this is always the way!)
Clever use of averaging and aggregating (Fig. 24) is
important to make best use of the quantity we have, be
it averaging over different waves at the same point in
their decay, or averaging over times at points when the
system has roughly the same size (e.g. summing over
three o’clocks in the morning).
But more data would also be very helpful. An immediate target for future work might be to establish a line
for more data, be that from Google, from the Library of
Congress or from Twitter itself.
Associated work by Sandra Chapman, Duncan Robertson and Ed Bullmore has investigated shocks in neurological and financial systems. There remains a great deal
of scope for exploring parallels between those types of
system and online social networks. Those investigations
have apparently moved in an information-theoretical direction, and it would be very interesting to see whether
such an approach might be applicable to the study of
Twitter.
[1] “Social networks: The great tipping point test,” New
Scientist (Jul 26, 2010).
[2] “New York plane crash: Twitter breaks the news, again,”
Daily Telegraph (Jan 16, 2009).
[3] “Twitter study,” Pear Analytics (Aug 12, 2009).
[4] “The mathematical formula for how celebrity gossip
spreads on the internet,” Daily Mail (Mar 31, 2010).
[5] “Internet memes,” BBC Focus (Aug 26, 2010).
[6] H. Kwak, C. Lee, H. Park, and S. Moon, in WWW
’10: Proceedings of the 19th international conference on
World wide web (ACM, New York, NY, USA, 2010) pp.
591–600, ISBN 978-1-60558-799-8.
IV.
FURTHER WORK
16
[7] Thanks to Yusuke Yamamoto for Twitter4J and to Rich
Hickey et al. for Clojure.
[8] D. Sornette, Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder:
Concepts and Tools (Springer Series in Synergetics)
(Springer, 2006) ISBN 3540308822.
[9] A. Clauset, C. R. Shalizi, and M. E. J. Newman, SIAM
Review, 51, 661 (2009).
[10] J. P. Sethna, K. Dahmen, S. Kartha, J. A. Krumhansl,
B. W. Roberts, and J. D. Shore, Phys. Rev. Lett., 70,
3347 (1993).
[11] J. P. Sethna, Statistical Mechanics: Entropy, Order
Parameters and Complexity (Oxford Master Series in
Physics), illustrated edition ed. (Oxford University Press,
USA, 2006) ISBN 0198566778.
Download