Retweet Waves: Contagion in a Social Network Alistair Tucker∗ Complexity Science DTC, University of Warwick. (Dated: November 10, 2010) Twitter is 2010’s preeminent ‘microblogging’ service, the medium used by millions to broadcast 140-character ‘tweets’ to one another. A large subset of its activity is visible to all, providing a great resource to researchers seeking to understand more about interactions within large groups. For this investigation we hooked into Twitter’s infrastructure to record tweets as they were made, obtaining six distinct data sets. Our analysis focuses on the ‘retweet’ wave, a phenomenon in which the content of a tweet propagates through repetition by other users. We take a ‘top-down’ approach, hoping to find within our statistics clues as to the description or descriptions most appropriate to the system. The suggestion of power laws in wave size distribution indicates parallels with other avalanching systems. We introduce an illustrative mean-field model that displays finite-size scaling, a possible mechanism by which to explain variations in wave size PDF over time. Attempts to induce the corresponding curve collapse for observed distributions are however inconclusive in the absence of more data. We further make measurements of shape and temporal extent of retweet waves, hoping to uncover probabilistic relations between the size of a wave and the distribution in time of its component tweets. I. INTRODUCTION If Social Science has traditionally been regarded as a field handicapped by a lack of hard data, the time has come to revise that opinion. In the age of the Internetbased social network, there have come online new sources bearing a wealth of information on human interaction [1]. But what they provide is a very different quantity to what may be gleaned from the familiar surveys— typically conducted via questionnaire on samples that are likely to be small and, without care, unrepresentative. It is worth considering anew what approaches to the data might profitably be attempted. Twitter is one of the new sources, a popular microblogging and social networking service. It has been known to play a rôle in the propagation of serious news stories [2], but a recent report found ‘pointless babble’ to be the largest single category into which activity was deemed to fall [3]. Celebrity gossip remains its forte. At first blush then, it might seem a frivolous object of study. There is the appeal that work on this ‘fun’ topic is capable of generating headlines in outlets with far greater readership than Physical Review E. But despite its appetite for excitable headlines on Twitter-based research, the wider public is still unlikely to compare its value favourably with, say, that of cancer research [4]. In fact it is to be hoped that research into Twitter may be rewarded with insight into the world beyond Twitter. There is no class of interaction occurring within this system that is hidden, so we might hope for analyses less speculative than those we must engage in for more ‘black box’-like systems. But it is possible that what we learn ∗ My thanks to supervisor Duncan Robertson, also to Profs. Sandra Chapman and Robin Ball for their insightful suggestions. may be reapplied to such systems, enhancing our understanding of social behaviours in general or, further, Complexity as a whole. It is an insight of that field that systems apparently dissimilar on the small scale may have deep connections on the large scale. (And since a cancer also qualifies as a complex system, perhaps there really is the prospect of some benefit to medical research!) Less far off, we can see many reasons why it is important to understand how people interact and come to consensus. International development and environmental protection are examples of fields in which it is often found that top-down policy dicta fail to have the desired effect. There might be some awareness of a failure to engage local populations or to plan correctly for the dissemination of knowledge, but it is generally unclear how to ameliorate the situation. Those charged with achieving such a change of culture would be aided by a sound grasp of just how change is seeded and established in social systems. Financial market regulators suffer with related problems. Recent times in particular have seen them appear impotent in the face of market forces they can only pretend to understand. What we can say is that these forces arise from the interplay of the many parts of a complex social system, even though lines of interaction may be murky. A term often used in accounts of the credit crunch of 2007-2008 is ‘contagion’, referring to the way in which cashflow disruption, or even just perception of a decline in creditworthiness, was seen to spread like an infection from corporate entity to corporate entity. We shall see that such epidemiological metaphors are also highly applicable to activity in the Twittersphere. For some there is much of interest in Twitter for its own sake. Going back a long way, news organisations have sought dialogue with their audience, traditionally through the letters page, more recently through the com- 2 ment box. Twitter opens up a new frontier. But in their interest we might also see a defensive posture. After all Twitter is potentially another prong of the Internet’s feared assault on traditional media. And naturally marketing organisations are interested in what looks like a new way of reaching a large (and rapidly increasing) audience. The ‘viral’ campaign is something of a Holy Grail to modern marketers— advertisement so beautifully designed that it spreads of its own accord. Wieden + Kennedy has enjoyed success this summer with its much lauded campaign for Old Spice. There is even evidence that it actually increased sales of the smelly stuff. For any company or individual with a brand to build or protect, this is serious business. Relevant questions include: does good news travel faster than bad news? and what are the qualities of a viral tweet? Most high-minded of all, we may speculate that an understanding of how the Twittersphere thinks may help us to answer the deepest questions about the thought processes of society itself. How do ideas spread? How do they resonate or cohere or find beauty with what has gone before? Is this a process without end? Is it meaningful to talk about progress in thought? Philosophers have traditionally been more interested in the psychology of the individual when it comes to understanding how we perceive the world. But the lens through which we look is arguably as coloured by the collective experience. Philosophy itself may be regarded as a set of ideas that filter down to influence everyone’s perception. We might describe the focus of this project as the spread of ‘memes’ (a buzzword used to describe ideas that infect society rather in the manner of viruses [5]). For as long as humans have thought, they have wondered how thoughts arise. The individual mind may remain mysterious, but we may now be able to make great progress in understanding the collective mind. Exploration of this new data source is also attractive because, as new territory, it remains largely unmapped. We aim to approach the subject in as ‘top-down’ and as model-free a manner as possible. By beginning with objective measurements and statistics, we hope finally to come to a qualitative understanding of what descriptions might fit the system. II. THE DATA SOURCE Twitter is a web- and mobile phone-based service that permits users to broadcast tweets of up to 140 characters to a band of self-selected followers. The network of followers and followed forms a directed graph. Statistics relating to the structure of the graph have been presented elsewhere [6]. The relationship between numbers of followers and followed at a node is generally asymmetric. The follower mechanism is probably the major channel through which users find tweets that are of interest to them. But it is not the only one—searches and ‘trending topics’ also play a part. Retweets and hashtags are features of Twitter helpful to us because they allow us to proceed without attempting full textual analysis. Interestingly both innovations came from the user community itself, and were only introduced as formally supported features at a later date. Hashtags, being simply a word or combination of words preceded by the # symbol, are used as a means of labelling individual tweets according to a theme. Hashtags can potentially develop in popularity over a lengthy period of time. Constantly revitalised by new contributions, like a mutating cold virus, there need be no upper limit on the number of times a hashtag spreads and reinfects. We might study quantities that have direct parallels to those studied in models of infectious diseases, for example the critical threshold of infectiousness at which the disease never dies out. But we focus on retweets, where users simply copy tweets they like to their own followers, including the code RT and an attribution. Since a genuine retweet (one that leaves its content unmodified) will be made once at most by each user, there is an upper limit on the extent of its spread (’though not necessarily an upper limit to the time over which it occurs). The timescale over which retweet waves play out is smaller than that of hashtags. For this reason they are easier to work with, given our data sets of limited duration, because we may be more confident that we have observed them in (close to) their entirety. Although we cannot identify everyone with whom an idea resonates, we can identify the subset with whom it resonates to such an extent that they are moved to retweet it. To model the process of retweeting on a fine scale would be hard. Perhaps we might simplify it thus: Each user checks Twitter at intervals assumed to be random according to some time-dependent rate, a function that is individual to the user but that might sensibly be linked to timezone. Each time they check, they will be faced with a list of tweets made by those they follow. Unless they check quite religiously, there is a good chance that there will have been made more than a page’s worth of tweets since they last checked. It is likely that those made longer ago will never be read; certainly the chance of their being retweeted must decline with time. But if something is retweeted again by another of those followed, it will rise once more to the top. Thus the more retweets, the more likely it is to be read by this user, who may in turn be tempted to retweet it again. Twitter also provides the service to mobile phones, possibly a different dynamic again. Also recall that people may retweet tweets that they find through searches or by following the trending topics of the moment. So an accurate microscopic model looks difficult to construct. We have to hope that we will be able find things to measure that are not overly affected by the simplifications we shall be forced to make. 3 TABLE I. Sets of data in this study, the collection periods and ‘track keywords’ used to filter the stream. LABOUR 2010-05-06 15:03 to 2010-05-07 21:30 (98 875 tweets) ‘Labour’. ELECTION 2010-05-07 21:30 to 2010-05-11 20:53 (405 255 tweets) ‘Labour’, ‘Conservative’, ‘Conservatives’, ‘Tory’, ‘Tories’, ‘Liberal’, ‘Lib’, ‘Cameron’, ‘Clegg’, ‘Cleggy’, ‘Weggy’. COALITION 2010-05-12 16:41 to 2010-05-16 08:07 (101 512 tweets) ‘Con/Dem’, ‘Con-Lib’, ‘Lib-Con’, ‘Cleggeron’, ‘Cameron/Clegg’, ‘Conservatives’, ‘Lib’, ‘Liberal’, ‘LibDem’, ‘LibDems’, ‘Cameron’, ‘Clegg’, ‘Cleggy’, ‘Weggy’. BUDGET 2010-06-22 12:59 to 2010-06-28 22:46 (149 569 tweets) ‘Budget’, ‘Osborne’, ‘Deficit’, ‘Debt’, ‘Tax’, ‘Cuts’, ‘Pensions’, ‘Benefits’, ‘Employment’, ‘Unemployment’, ‘NHS’, ‘Defence’, ‘Banks’, ‘Green’, ‘Coalition’, ‘Aid’, ‘EU’, ‘Universities’, ‘Schools’, ‘Progressive’. BB11 2010-07-28 19:09 to 2010-08-03 23:24 (105 706 tweets) ‘#BB11’ EDFRINGE 2010-08-09 19:48 to 2010-08-25 03:09 (40 000 tweets) ‘#EdFest’, ‘#EdFringe’, ‘Edinburgh Festival’, ‘Edinburgh Fringe’, ‘Ed Fringe’. A. Data Acquisition Twitter is primarily a web-based service. As such its output is consumed principally by users browsing HTML web pages. It would be arduous (if possible) to extract the information we need directly from the HTML. So it is fortunate that Twitter makes available APIs that we can use. These are mostly aimed at third-party applications and web sites. There are three separate APIs, the REST API, the Search API and the Streaming API. All have limits on the amount of data one may draw out, although it is possible to apply for enhanced access (whitelisting). For this investigation I chose to use the Streaming API, estimating that it would yield the greatest quantities. It is necessary to filter the output using ‘track keywords’ in order to limit the amount coming in to a level that Twitter will allow. Some care is required in tuning these so as to collect as much as possible without triggering a ‘track limitation’ notice. Despite our attention, it was indeed such a notice that brought an end to four out of six of our collections. I built our client using open-source components: a Java library Twitter4J whose methods I call from Javafriendly Lisp Clojure [7]. Please contact me for a copy of the code used. Clojure’s read-eval-print-loop (REPL) was invaluable in providing an interactive approach to data collection and exploration. MATLAB was used for further processing and exploration and for the drawing of the various charts contained herein. Twitter is something of a moving target. It redesigned its web interface during the course of these investigations and also made changes to the Streaming API we were using. It recently donated its archive to the US Library of Congress and also made its historical activity available for search via Google. At the time of writing, there seems still to be no way to take advantage of the Library’s acquisition. And although Google does provide APIs for several of its search functions, unfortunately the Twitter results seem only to be available via browser-rendered HTML. III. ANALYSIS AND RESULTS Let us call a ‘wave’ the sequence (or avalanche) of retweets that proceeds from a single tweet. The initial representation of a wave to come from the data is the set of times at which tweet and retweets occur. We may bin the data on the time axis to recover a rate function. Alternatively waiting time between tweets may be regarded as a function of tweet number (sometimes the more natural, and parsimonious, approach). Either way, a wave can be said to have some characteristic size, shape and timescale. To establish relationships in probability between these three is our aim. We should like this to give us insight into the type of model appropriate to this domain. Even leaving such hopes to one side, it would be an achievement to learn something of the circumstances un- 4 der which a tweet is likely to garner most exposure. Our distributions need not be independent of time or track keyword, but we ought still to be able to establish relations. A. Distribution of Wave Size In physical models of sandpiles, ferromagnets and the like, it is common to study avalanches. It is generally assumed that there is a separation of timescales between the one at which the system is driven and the one at which the avalanches play out. Effectively avalanches are supposed to occur instantaneously. A wave of retweets is somewhat analogous to an avalanche. The size of the wave we define as the number of individuals who retweet, just as the size of an avalanche is the number of sites that change state. In practice retweet waves are not instantaneous. In any set of measurements they will necessarily be truncated by the moment at which the measurements were made. A consequence is that we can never be sure that a wave has been recorded in its entirety. (And it is not clear that suitable models would admit that a wave ever comes to a complete stop.) Nevertheless we can and do approximate wave size by the number of retweets we count within our sample. If the resultant error is significant, we ought to be able to observe its effect in the difference between the distribution of waves recorded at the beginning of the sample and the distribution of waves recorded at the end. Where S is the random variable that stands for the size of some wave, let f (s) = Pr(S = s) F (s) = Pr(S ≥ s) = ∞ X f (r) r=s It is instructive to plot on a log-log scale the estimate of F that is the rank-ordered plot of recorded data (Figs. 1 to 6). There is in those figures the suggestion of a straight line. That would imply a power law, a symmetry across scales. Claims of power laws have been made with enthusiasm in many spheres, and frequently in social phenomena, as their existence can be evidence for certain underlying mechanisms [8]. However it is difficult to demonstrate empirically the existence of a power law. A methodical reworking of previous claims concludes that many are unfounded [9]. With the limited number of data that we currently have, we would need to appeal to a priori reasons to conclude a power law. In a later section, III A 2, we attempt to establish whether such reasons might exist. 1. Time Dependence Each one of Figs. 1 to 6 depicts the distribution derived from the set of tweets made over the whole of the (respective) collection period. The question that these plots address is, “What is the probability that a wave picked at random from the collection period has size s?” As such, each of those distributions may be regarded as a weighted average of distributions for each of a number of equal-size time sections that partition the collection period. (Alternatively it might be regarded as an unweighted average of distributions relating to each of a number of equal-size measure sections that partition the total collection of tweets.) It is to be assumed that there exists a size for time (or measure) sections sufficiently small that within it the process under investigation may be regarded as stationary in important senses. System size, for example, is a quantity that varies greatly throughout the day as people log into and out of Twitter. But for a sufficiently small period of time we can surely regard it as constant. If we are to partition the data sets in time, we face again the practical issue that waves do not occur as instantaneously as we would like to pretend. It is not entirely clear how to assign waves to time sections when many sprawl across section boundaries. In a world of instantaneous waves it must be the case that rτ (s), the rate at which waves of size s occur during time section τ (assumed constant over that section), obeys s rτ (s) = 1 nτ (s) |τ | ∀s ∈ N where nτ (s) is the number of tweets counted within the interval τ that belong to a wave of size s. By defining rτ (s) in terms of nτ (s) according to this equation, we maintain continuity as we move to the real-world situation of temporally extended waves. It amounts to the assignment of fractions of a wave to different sections. A simpler approach might be to assign a wave in its entirety to the section in which its first tweet appears. But for waves of longer duration in particular, this hasn’t always felt quite right. Figs. 7 through 12 show total rate functions for waves and tweets respectively, X X RW = r(s) RT = s r(s) s s where r(s) has been estimated for each time section as described above. As one might expect, these measures vary substantially over time, most obviously on a diurnal basis. 5 0 0 10 10 ï1 10 ï2 10 ï2 10 ï4 ï3 f(s) F(s) 10 10 ï6 10 ï4 10 ï8 10 ï5 10 ï6 10 ï10 0 1 10 2 10 10 10 3 10 0 1 10 (a) 2 10 s 10 3 10 s CDF: F (s) = Pr(S ≥ s) (b) PDF: f (s) = −F 0 (s) FIG. 1. LABOUR: distribution of wave size S. 0 0 10 10 ï1 10 ï2 10 ï2 10 ï4 f(s) F(s) 10 ï3 10 ï6 10 ï4 10 ï8 10 ï5 10 ï6 10 ï10 0 1 10 2 10 10 10 3 10 0 1 10 (a) 2 10 s 10 3 10 s CDF: F (s) = Pr(S ≥ s) (b) PDF: f (s) = −F 0 (s) FIG. 2. ELECTION: distribution of wave size S. 0 0 10 10 ï1 10 ï2 10 ï2 10 ï4 f(s) F(s) 10 ï3 10 ï6 10 ï4 10 ï8 10 ï5 10 ï6 10 ï10 0 1 10 2 10 10 3 10 10 0 1 10 (a) 2 10 s 10 s CDF: F (s) = Pr(S ≥ s) (b) PDF: f (s) = −F 0 (s) FIG. 3. COALITION: distribution of wave size S. 3 10 6 0 0 10 10 ï1 10 ï2 10 ï2 10 ï4 ï3 f(s) F(s) 10 10 ï6 10 ï4 10 ï8 10 ï5 10 ï6 10 ï10 0 1 10 2 10 10 10 3 10 0 1 10 (a) 2 10 s 10 3 10 s CDF: F (s) = Pr(S ≥ s) (b) PDF: f (s) = −F 0 (s) FIG. 4. BUDGET: distribution of wave size S. 0 0 10 10 ï1 10 ï2 10 ï2 10 ï4 f(s) F(s) 10 ï3 10 ï6 10 ï4 10 ï8 10 ï5 10 ï6 10 ï10 0 1 10 2 10 10 10 3 10 0 1 10 (a) 2 10 s 10 3 10 s CDF: F (s) = Pr(S ≥ s) (b) PDF: f (s) = −F 0 (s) FIG. 5. BB11: distribution of wave size S. 0 0 10 10 ï1 10 ï2 10 ï2 10 ï4 f(s) F(s) 10 ï3 10 ï6 10 ï4 10 ï8 10 ï5 10 ï6 10 ï10 0 1 10 2 10 10 3 10 10 0 1 10 (a) 2 10 s 10 s CDF: F (s) = Pr(S ≥ s) (b) PDF: f (s) = −F 0 (s) FIG. 6. EDFRINGE: distribution of wave size S. 3 10 7 RT RT RW RW 5 10 5 10 4 10 4 10 18:00 00:00 FIG. 7. LABOUR. 06:00 12:00 Thu 06 / Fri 07 May RW = P r(s) s 18:00 RT = Wed 23 Jun Thu 24 Jun Fri 25 Jun Sat 26 Jun Sun 27 Jun Mon 28 Jun P s r(s). s 6 FIG. 10. BUDGET. RW = P s r(s) RT = P s r(s). s 6 10 10 RT RT RW RW 5 10 5 4 10 10 3 10 4 10 Sat 08 May 2 Sun 09 May FIG. 8. ELECTION. Mon 10 May RW = P s 10 Thu 29 Jul Tue 11 May r(s) RT = P s s r(s). 5 Fri 30 Jul FIG. 11. BB11. Sat 31 Jul Sun 01 Aug Mon 02 Aug Tue 03 Aug RW = P s r(s) RT = P s s r(s). 4 10 10 RT RW 3 10 4 10 RT RW 2 Thu 13 May Fri 14 May FIG. 9. COALITION. RW = Sat 15 May P s r(s) 10 Sun 16 May RT = P s s r(s). Sun 15 Aug FIG. 12. EDFRINGE. RW = Sun 22 Aug P s r(s) RT = P s s r(s). 3 4.5 2.8 4 2.6 3.5 alpha alpha 8 2.4 3 2.5 2.2 2 2 1.8 Sat 08 May Sun 09 May Mon 10 May 1.5 Thu 29 Jul Tue 11 May FIG. 13. ELECTION: likelihood of exponent α over time (80% confidence). Fri 30 Jul Sat 31 Jul Sun 01 Aug Mon 02 Aug Tue 03 Aug FIG. 16. BB11: likelihood of exponent α over time (80% confidence). 3.6 3.4 3.2 alpha 3 2.8 2.6 2.4 2.2 2 1.8 Thu 13 May Fri 14 May Sat 15 May Sun 16 May 3 2.9 FIG. 14. COALITION: likelihood of exponent α over time (80% confidence). 2.8 alpha 2.7 2.6 3.6 2.5 3.4 2.4 3.2 2.3 alpha 3 2.2 2.8 Sun 15 Aug Sun 22 Aug 2.6 2.4 FIG. 17. EDFRINGE: likelihood of exponent α over time (80% confidence). 2.2 2 1.8 Wed 23 Jun Thu 24 Jun Fri 25 Jun Sat 26 Jun Sun 27 Jun Mon 28 Jun FIG. 15. BUDGET: likelihood of exponent α over time (80% confidence). 9 Although the rates are highly time-dependent, that does not imply that the normalised PDF, rτ (s) fτ (s) = P 0 s rτ (s ) need also vary with time. However constant fτ (s) would imply parallel lines (to within statistical error) for RT and RW in Figs. 7 to 12. At a glance, that is not obviously the case. We may examine more closely the question of time dependence of fτ (s) using a statistic that seems natural given the semblance of a power law, namely the likely exponent ατ in the assumed relation, 1 fτ (s) = s−ατ ζ(ατ ) which gives us that for data set Dτ associated with time section τ , Pr(Dτ |α) = |Dτ | Y 1 e−α log sj |D | τ ζ(α) j=1 We are told that the exponent of a power law is best estimated using the Maximum Likelihood (ML) procedure [9]. It is simple to perform this task numerically, basing results on the log-likelihood function, Lτ (α) = log Pr(Dτ |α) ∞ ∞ X X 1 1 = −α nτ (s) log s − log ζ(α) nτ (s) s s s=1 s=1 This function need only be maximised to recover the ML estimate. The red crosses in Figs. 13 through 17 correspond to the modes of Lτ (α). But by normalising, we can use the log-likelihood function to construct a distribution over α, But we might hope still to be able to model it. One approach is to regard the variation in our estimator as the consequence of finite-size effects that distort the shape of fτ (s) (now strictly a power law only in the infinite-size limit) as the system size changes over time. There is the hint of a ‘roll-off’ in the tails of our distributions (Figs. 1 to 6). This is a characteristic of finite-size scaling, illustrated by the simple mean-field model that we introduce next. 2. Inspired by the parallels between retweet waves and mean-field models of avalanches that manifest as Barkhausen noise in ferromagnets [10], we introduce a simple theoretical model with which we can illustrate possible approaches to real data. At best it can be said to approximate only one small aspect of the process. So although we begin by specifying an ‘external’ field H, we are not requiring that it be external to the system, merely noting that its source is external to our model. H refers to a pressure exerted on the community in a particular direction of ‘thought-space’. Its effect is felt at the point where it reaches a level such that some individual, the one most predisposed to do so, is moved to articulate it. That event serves to strengthen the field as exerted on the rest of the community. Depending on their predisposition to do so, measured by a random field fi associated with each, others may also respond by expressing the same meme. This in turn will increase the field yet further. Avalanche Size PDF 0 10 e Lτ (α) p(α|Dτ ) = P Lτ (α0 ) α0 e L=99998 L=9998 L=998 ï1 10 ï2 10 ï3 10 D(s,L) In general it will not be possible to perform this calculation directly on the computer (at least in MATLAB) because of machine limitations to floating point numbers. However the issue may be sidestepped by transforming our equation thus, ï4 10 ï5 10 ï6 10 e Lτ (α)−mτ p(α|Dτ ) = P Lτ (α0 )−mτ α0 e where Illustrative Model of Meme Spread ï7 10 mτ = max Lτ (α). α ï8 10 0 10 From the normalised distribution p(α|Dτ ) we easily derive median and error bars, as depicted in blue in Figs. 13 through 17. It does seem clear that the variations exceed what one can expect through error alone, and therefore that fτ (s) may not be considered constant in time. 1 10 2 3 10 10 4 10 5 10 s FIG. 18. The PDF of avalanches in the illustrative model for systems of three different sizes L. It is clear that in the infinite-size limit we would see a strict power law with exponent α = 32 . 10 Specifically (and with only limited justification) we imagine a subpopulation of L susceptible individuals, having their predisposition fi distributed uniformly across an interval of length L. Avalanche Size CDF 0 10 L=99998 L=9998 L=998 ï1 10 ï2 10 fi ∼ U (H, H + L) ï3 10 PL(S* s) and stipulate that the meme is repeated by any individual whose predisposition satisfies X fi ≤ H + sj ï4 10 ï5 10 j ï6 with si denoting the state of an individual i, a switch from 0 to 1 if and when that individual responds. The chain of repetition may be regarded as an avalanche. The distribution of avalanche size is plotted in Fig. 18. The distributions are self-similar and we recognise finite-size scaling [11], expressed thus in terms of L, D(S, L) = S − 32 D(SL−1 ) 3 Therefore we may plot S 2 D against SL−1 to see the curve collapse (Fig. 19). Curve Collapse 2.5 L=99998 L=9998 L=998 D(s,L) / s ï3/2 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 Lï1 s 0.6 0.7 0.8 0.9 1 FIG. 19. The PDFs of systems of different size are self-similar, as can be seen from this curve collapse. Fig. 20 shows a slightly different situation. Each of the lines relates to a distribution that is effectively an average over a period during which the system size randomly varies between its maximum and a tenth of that value. This has the effect of smoothing the ‘roll-off’ visible in the CDF, somewhat reminiscent of what we see in the real Twitter data. 3. Finite Size Scaling in the Data The ELECTION data set, the largest and the busiest we have, is the natural choice to work with. Our hypothesis is that as the size of the system varies over the 10 ï7 10 ï8 10 0 10 1 10 2 3 10 10 4 10 5 10 s FIG. 20. The CDF of avalanches in the illustrative model for systems of different maximum size. Each distribution averages over a time period during which the system size varies between its maximum value and a tenth of that. period, we shall see finite-size scaling in the distribution of wave size. In that case we should be able to perform a curve collapse with real data just as we did in Fig. 19 with simulated data. In order to get there, we shall need to divide the period into sections sufficiently small that we can assume system parameters such as size to be constant (or substantially reduced in variability). Unfortunately we have no good measure for system size, especially as time sections become short. It is impossible to know who is logged in to Twitter unless they engage in tweeting. So we just count the number of unique users tweeting during a time section τ and call that the system size L. This quantity is perhaps not a particularly valuable one; the number of tweets itself might have served just as well as a measure of system size. Fig. 21 shows the situation where the period has been split into ten equal-size time sections. Fig. 22 shows how data collapse might be attempted. Both plots are messy and jagged even with as few as ten sections. It would be unduly optimistic to hope to perform a meaningful curve collapse. But our aim in partitioning the time period is to end up with time sections in which system parameters (such as system size) vary to only a fraction of the degree to which they vary over the whole period. Fig. 13 suggests that at least fifty time sections would be required to achieve this. If it is hard to imagine success with only ten sections, then it must be virtually impossible for fifty. Smoothing the CDFs is one approach that might make them easier to work with. Instead we turn to the PDFs and, with a similar intention, bin data points. In some ways, the PDF is a more suitable quantity to deal with. It is perhaps more intuitive, and is also free of the spurious 11 0 0 10 10 ï1 10 ï2 10 ï2 10 ï3 ï4 10 fo(s) Pr(S>=s) 10 ï4 10 L=16670 L=19672 L=17195 L=9203 L=15864 L=11104 L=17063 L=37768 L=15565 L=59965 ï5 10 ï6 10 ï7 10 ï8 10 L=16670 L=19672 L=17195 L=9203 L=15864 L=11104 L=17063 L=37768 L=15565 L=59965 ï6 10 ï8 10 ï10 0 1 10 2 10 10 3 10 10 0 1 10 2 10 s 10 3 10 s FIG. 21. The CDFs of ELECTION wave sizes under a time partition of ten equal-size sections. FIG. 23. The PDFs of ELECTION wave sizes under a time partition of ten equal-size secions. 0 10 0 10 ï2 10 ï1 ï4 10 f(s) Pr(S>=s) / sïa/c 10 ï6 10 ï2 10 L=16670 L=19672 L=17195 L=9203 L=15864 L=11104 L=17063 L=37768 L=15565 L=59965 ï3 10 ï4 10 ï2 10 230 < L ) 455 455 < L ) 898 898 < L ) 1772 1772 < L ) 3498 3498 < L ) 6905 6905 < L ) 13630 13630 < L ) 26903 ï8 10 ï10 10 0 1 10 2 10 10 3 10 s ï1 10 0 10 Lïcs 1 10 2 10 FIG. 22. The same collection of CDFs scaled in such a way as might have been hoped to induce curve collapse. FIG. 24. The PDFs of ELECTION wave sizes under a time partition of 200 sections, subsequently aggregated according to L. B. correlations that one sees in CDF plots. Fig. 23 suggests that despite the binning, it will still be hard to make much of the data, even with as few as ten sections. Our final attempt, shown in Fig. 24, is more promising. Here we have partitioned the time period into many (200) sections, so that it may be assumed that system parameters within a section are almost constant. But we have then aggregated the sections according to each’s value of L (for which ‘system size’ becomes an increasingly inaccurate description as the sections become small). Individual Waves in Time In the previous section III A we clung to the convenient fiction that retweet waves occur instantaneously. For some purposes, where we have clear separation of timescales, it may be reasonable to make this simplification. But there does exist some timescale (however short) over which a wave develops and decays. Behavour at this scale may yield further clues as to the nature of the process we investigate. Naturally the statistics for individual waves are subject to a level of noise. By basing our analysis on the whole ensemble of wave observations, we hope to mitigate the effects of randomness. 12 But it can be difficult to know how to ‘average’ over waves. To do so we require some minimal model relating one wave to another. We make a start in parametrising that model in the next part III B 1. Examination by eye suggests that a large wave might reasonably be viewed as a superposition of a number of smaller waves, each triggered by a retweet made by one of a relatively small number of influential individuals. 1. Relationship between Wave Size and Timescale It might be supposed that, on average, the time envelope associated with a large wave (one with more participants) will have greater extent than one associated with a small wave. That would be the natural consequence of the time taken for the ‘contagion’ to spread from node to node. In order to test this hypothesis, we need to take a measurement of each wave that encapsulates its inherent timescale. We should like to average these measurements over waves of the same or similar sizes, then to compare those averages between the sizes. But many measurements will depend on wave size in a trivial way, and will result in the suggestion of a spurious relation. To pin this issue down, we posit that for each wave there exists an underlying distribution in time (its envelope) according to which its tweets are distributed. Then the parametrisation of one such distribution can be compared with that of another, without the precise number of tweets manifested having direct effect. Of course it is not possible directly to observe the parameters of these underlying distributions since we have only a sample from each, the observed tweets. The best that we can hope for is a measurement that constitutes an unbiased estimator of an underlying parameter; that is, one whose expectation is equal to the underlying value. The Law of Total Expectation tells us that h i h h ii EV̂ V̂ = EV EV̂ |V V̂ |V where V̂ is our estimator for parameter V of the underlying distribution. An unbiased estimator will further obey h i EV̂ |V V̂ |V = V so that h i h i EV̂ V̂ = EV V = µV We have such an estimator in n V̂ = 1 X (xj − x̄)2 . n − 1 j=1 The expectation of V̂ is the variance V (in units of time squared) of the underlying distribution. Importantly that statement is independent of any assumption about that distribution’s shape. Then for a collection {v̂k | k = 1, . . . , m} of such estimates, the Central Limit Theorem applies for large m, Pm k=1 v̂k − mµV √ ∼ N (0, 1) . σV̂ m where µV is the mean variance of underlying distribution, a measure of the average wave timescale. Thus we are led to the following expression for the likelihood of that mean, ! m 1 X 1 2 µV ∼ N v̂k , σV̂ m m k=1 So by replacing σV̂2 with estimator σ̂V̂2 , calculated from the sample, we may put error bars on our estimates of µV . We take this approach with the Twitter data, binned according to wave size, and are rewarded with Figs. 25 to 29. It is important to note that a small value of m undermines both the validity of our use of the Central Limit Theorem and the accuracy of our estimator σ̂V̂2 . Some of the larger error bars in the figures indicate exactly that, and should therefore be taken with a pinch of salt. The whole of the EDFRINGE plot, for example, is dubious. However most of the other sets of results look reasonable. Remember that with 80% confidence intervals, we expect the true value of µV to lie outside the error bars on about one in five occasions. The dependence of µV on wave size is not at all dramatic, suggesting that the effect of delay as the thing spreads from node to node is not as important as one might have thought. We also have substantial variation in the magnitude of µV according to the data set. Of course they are not really comparable, being taken from different populations at different times. Again a partitioning of the period into smaller time sections might permit comparisons. 0.25 0.25 0.2 0.2 Mean µV of wave variance (days2) Mean µV of wave variance (days2) 13 0.15 0.1 0.05 0 0 10 0.15 0.1 0.05 1 2 10 10 0 0 10 3 10 1 2 10 Wave size s 10 3 10 Wave size s FIG. 25. ELECTION: dependence of wave timescale on wave size (80% likelihood). FIG. 28. BB11: dependence of wave timescale on wave size (80% likelihood). 0.25 Mean µV of wave variance (days2) 0.2 0.15 0.1 0.05 0 0 10 1 2 10 10 3 10 Wave size s 0.25 0.25 Mean µV of wave variance (days2) 0.2 0.2 Mean µV of wave variance (days2) FIG. 26. COALITION: dependence of wave timescale on wave size (80% likelihood). 0.15 0.1 0.05 0.15 0 0 10 1 2 10 10 3 10 Wave size s 0.1 FIG. 29. EDFRINGE: dependence of wave timescale on wave size (80% likelihood). 0.05 0 0 10 1 2 10 10 3 10 Wave size s FIG. 27. BUDGET: dependence of wave timescale on wave size (80% likelihood). 14 Individual Wave Development and Decay In the absence of a solid framework by which to perform averages over many waves, we ask what we can learn from individual examples. Figs. 30 and 31 depict the development and decay of the two largest waves of the ELECTION data set. @Jason at DAVID: Nick Clegg has changed his Facebook relationship status to: “it’s complicated.” @stephenfry: LibDem/Tory Rule! All will now be well. Justice prosperity kindness happiness for all! Yay! What could be better? #heavyhandedirony In these plots we have inverted the usual relationship of measure µ and time t. It is then easy to produce error bars for the mean waiting time τ (in a similar procedure to that employed in section III B 1). The simplicity of the plots, virtually a straight line in long sections (those free of influential retweets), suggests a law for the decay of waves. Suppose we have a period over which waiting time τ varies as "Its Complicated" (90% confidence) 9 8 7 6 log waiting time 2. 5 4 3 2 1 0 0 200 400 600 800 1000 retweet number 1200 1400 1600 FIG. 30. Decay of the largest wave of ELECTION (90% likelihood). log τ = α + βµ Then t = eα 1 + eβ + e2β + · · · + e(µ−1)β eβ t = eα eβ + e2β + · · · + e(µ−1)β + eµβ so eβ − 1 t = eµβ − 1 eα #heavyhandedirony (90% confidence) 4.5 We have 4 1 log 1 + e−α eβ − 1 t β α −1 e−α eβ − 1 dµ e −1 = =β +t dt β (1 + e−α (eβ − 1) t) eβ − 1 3 log waiting time µ= 3.5 2.5 2 1.5 1 0.5 0 ï0.5 0 100 200 300 retweet number 400 500 600 FIG. 31. Decay of the second-largest wave of ELECTION (90% likelihood). 15 The largest single wave contained within the ELECTION data set is made up of 584 retweets, each saying, RT @Jason at DAVID: Nick Clegg has changed his Facebook relationship status to: “it’s complicated.” But another 970 tweets contain the character strings ‘Nick Clegg’, ‘Facebook’, ‘relationship status’ and ‘complicated’. All express the same joke. Many of those 970 do attribute Jason at DAVID, or retain exactly the same phrasing, suggesting that they do derive from the same source, but somehow became separated from the main wave. (There are other ways of retweeting that the system will not recognise, copypasting being the simplest.) However it does look as though some of them may have originated independently. Of course it is hard to be sure. But I am prepared to believe that they came about independently, given that the joke was, perhaps, a relatively obvious one to make at that time. It is sometimes the case in the sciences, in parallel, that different researchers will come up with the same ideas independently. (Not so much the great and unexpected ideas perhaps, but those that are a natural step from what has gone before.) Perhaps there was, in some sense, a certain inevitability to this tweet given the time and the situation. I like to think that this lends credibility to the idea of a ‘pressure’ in a certain direction in ‘thought-space’ (as in our toy model) that somebody somewhere is bound to articulate. Where might a model such as that lead us? For every one of the large (but finite) number of tweets that it is possible to make, each individual has a field fi associated, representing its disposition to tweet or retweet. Thus we locate each individual at a point in a space whose axes are the possible tweets. There are bound to be strong correlations between an individual’s coordinates on the various axes. With enough data we might even hope to perform some algorithm such as PCA to reduce the effective dimensionality to a manageable number. We would have uncovered a workable definition of ‘thought-space’ ! It may not be assumed that an individual will stay still at any one location in this space. We might then ask how does its movement relate to activity within the Twittersphere. But it is naı̈ve to treat Twitter as a system in isolation—perhaps it might be better viewed as a visible subset of the scaled-up system that is society and society’s thought. In our data sets the information already exists that would permit us to identify individuals within waves, and to compare and contrast their activity across waves. To make use of that is likely to be an important step forward. It is also possible to download the follower graph itself, and to use this would vastly improve our knowledge of the paths taken by waves through the network. In this investigation we have made no consideration of the impact of network structure, although this might have been thought to be one of the more obvious lines of enquiry. But we have seen that our evidence for a timescale increasing with tweet size verges on the insignificant. And other work has shown that in general the paths are short [6]. We should try to measure network effects and establish somehow their importance. The simplest model is going to remain the fully connected network, and it is not clear that it is network effects that are the biggest threat to this model’s credibility. Another obvious path might be textual analysis, although this is potentially a complicated one to take. It is always possible though that a model of natural language will benefit as much from its link with a model of Twitter as vice versa. We suffered in this project from a lack of data. I had believed that I had collected quite a lot from the Twitter feed, but ultimately I found myself attempting analyses that wanted more. (Perhaps this is always the way!) Clever use of averaging and aggregating (Fig. 24) is important to make best use of the quantity we have, be it averaging over different waves at the same point in their decay, or averaging over times at points when the system has roughly the same size (e.g. summing over three o’clocks in the morning). But more data would also be very helpful. An immediate target for future work might be to establish a line for more data, be that from Google, from the Library of Congress or from Twitter itself. Associated work by Sandra Chapman, Duncan Robertson and Ed Bullmore has investigated shocks in neurological and financial systems. There remains a great deal of scope for exploring parallels between those types of system and online social networks. Those investigations have apparently moved in an information-theoretical direction, and it would be very interesting to see whether such an approach might be applicable to the study of Twitter. [1] “Social networks: The great tipping point test,” New Scientist (Jul 26, 2010). [2] “New York plane crash: Twitter breaks the news, again,” Daily Telegraph (Jan 16, 2009). [3] “Twitter study,” Pear Analytics (Aug 12, 2009). [4] “The mathematical formula for how celebrity gossip spreads on the internet,” Daily Mail (Mar 31, 2010). [5] “Internet memes,” BBC Focus (Aug 26, 2010). [6] H. Kwak, C. Lee, H. Park, and S. Moon, in WWW ’10: Proceedings of the 19th international conference on World wide web (ACM, New York, NY, USA, 2010) pp. 591–600, ISBN 978-1-60558-799-8. IV. FURTHER WORK 16 [7] Thanks to Yusuke Yamamoto for Twitter4J and to Rich Hickey et al. for Clojure. [8] D. Sornette, Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools (Springer Series in Synergetics) (Springer, 2006) ISBN 3540308822. [9] A. Clauset, C. R. Shalizi, and M. E. J. Newman, SIAM Review, 51, 661 (2009). [10] J. P. Sethna, K. Dahmen, S. Kartha, J. A. Krumhansl, B. W. Roberts, and J. D. Shore, Phys. Rev. Lett., 70, 3347 (1993). [11] J. P. Sethna, Statistical Mechanics: Entropy, Order Parameters and Complexity (Oxford Master Series in Physics), illustrated edition ed. (Oxford University Press, USA, 2006) ISBN 0198566778.