Are raw RSS feeds suitable for broad issue scanning? A science

advertisement
Page 1 of 16
Are Raw RSS Feeds Suitable for Broad Issue Scanning? A
Science Concern Case Study1
Mike Thelwall, Rudy Prabowo, Ruth Fairclough School of Computing and IT, University of
Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK.
E-mail: m.thelwall@wlv.ac.uk Tel: +44 1902 321470 Fax: +44 1902 321478
E-mail: rudy.prabowo@wlv.ac.uk Tel: +44 1902 321000 Fax: +44 1902 321478
E-mail: r.fairclough@wlv.ac.uk Tel: +44 1902 321000 Fax: +44 1902 321478
Broad issue scanning is the task of identifying important public debates arising within a given
broad issue; Rich Site Syndication (RSS) feeds are a natural information source for
investigating broad issues. RSS, as originally conceived, is a method for publishing timely and
concise information on the Internet, for example about the main stories in a news site or the
latest postings in a blog. RSS feeds are potentially a non-intrusive source of high quality data
about public opinion: monitoring a large number may allow quantitative methods to extract
information relevant to a given need. In this paper we describe an RSS feed-based co-word
frequency method to identify bursts of discussion relevant to given broad issue. A case study of
public science concerns is used to demonstrate the method and assess the suitability of raw RSS
feeds for broad issue scanning (i.e. without data cleansing). An attempt to identify genuine
science concern debates from the corpus through investigating the top 1000 ‘burst’ words found
only two genuine debates, however. The low success rate was mainly caused by a few
pathological feeds that dominated the results and obscured any significant debates. The results
point to the need to develop effective data cleansing procedures for RSS feeds, particularly if
there is not a large quantity of discussion about the broad issue, and a range of potential
techniques is suggested. Finally, the analysis confirmed that the time series information
generated by real-time monitoring of RSS feeds could usefully illustrate the evolution of new
debates relevant to a broad issue.
Introduction
For many types of social science research and market research (Pikas, 2005) there is a need for
information about public opinion or public reaction to certain topics. The Internet may help to address
this need because it is a natural new source of easily available information about the beliefs or
activities of a wide section of society (Burnett & Marshall, 2002; Hine, 2000; Schaap, 2004). The
widespread use of large information sources like those available on the Internet is a modern
phenomenon, part of the ‘informational turn’ of science (Wouters, 2000). One example of a specific
social science application is the study of public science debates (e.g., Corbett & Durfee, 2004), which
is pursued in this paper. In recent years there have been a few influential examples of public science
debates that have reached a level where public opinion has influenced science policy. In the cases of
genetically modified food (Hagendijk, 2004; Klintman, 2002) and stem-cell research (Hellsten &
Leydesdorff, 2004; Leydesdorff & Hellsten, 2005, to appear), highly technical scientific issues have
not been left to the scientists but have been influenced by lay opinion. From a science policy
perspective, extensive debates can damage public confidence: and this confidence is now an
important part of modern knowledge economies (Leydesdorff & Etzkowitz, 2003). Hence there is a
need for early warning systems, perhaps using Internet-based information sources, to give policy
makers time to respond rapidly to new public science concerns.
We introduce the phrase broad issue scanning to describe the task of identifying and tracking
important public debates arising within a given broad issue, such as public science concerns. This
differs from issue analysis (described below) in that the issue is broad enough so that the individual
debates are likely to be separate and to use completely different terminology. It differs from
environmental scanning (Wei & Lee, 2004) in that the focus is on public debates rather than
commercial competitive intelligence, although environmental scanning is sometimes drawn upon in
1
This is a preprint of an article to be published in the Journal of the American Society for Information Science and Technology ©
copyright 2005 John Wiley & Sons, Inc. http://www.interscience.wiley.com/
Page 2 of 16
non-commercial contexts, such as government forecasting and trend detection (Cairns, Wright,
Bradfield, van der Heijden, & Burt, 2004; Vander Beken, 2004).
Researchers seeking to extract information from Internet technologies using either qualitative
or quantitative techniques have studied many different types, including e-mail (Koku, Nazer, &
Wellman, 2001), newsgroups (Bar-Ilan, 1997; Caldas, 2003), web sites (Schaap, 2004), and blogs
(Nardi, Schiano, Gumbrecht, & Swartz, 2004). This research is often direct in the sense of seeking to
understand how the new technology is used, but can also be indirect: using online information to
illuminate research questions that are not intrinsically technology-centred. For qualitative research,
blogs are perhaps the ideal data source because they are easy to create and update and cover a wide
variety of types of use (Blood, 2004; Sunstein, 2004), even if their creators are not a fully
representative cross-section of society (Gill, 2004; Matheson, 2004). Blog analysis for financial gain
(Smith, 2005) and competitive intelligence purposes have also been discussed (Pikas, 2005). For
large-scale quantitative analysis, however, blogs are not ideal because they are highly repetitive and
complex in structure, making automated analyses difficult. The Really Simple Syndication (RSS)
technology, in contrast, seems ideally suited to automated processing because it allows extensive
metadata and concise site descriptions (see below). Many blogs, news sources and other web sites,
maintain RSS feeds that give brief descriptions of updated content on their site. RSS feeds seem set to
become ubiquitous on the Internet after a slow start (Notess, 2002), with all the major browsers
offering support for them by the end of 2005 (BBC, 2005). Currently (July, 2005), most web users
need RSS reader software to subscribe to the feeds of relevant sites, using this as a way of
automatically checking for updates. A monitoring program may poll the feeds hourly, reporting the
titles of each new feed item found. If the user is interested in a title, they can click on it to launch the
full report in the original web site. Automatic monitoring of RSS feeds, for example by word
frequency analysis, could give useful information about public opinion. Some researchers have tried
similar tasks, with a prominent example being IBM’s Web Fountain project (Gruhl, Guha, LibenNowell, & Tomkins, 2004). Nevertheless, no previous research has assessed the suitability of RSS
feeds for the identification and tracking over time of discussions related to a broad issue or concern.
As discussed below, previous issue analysis research methods have been essentially retrospective,
whereas RSS feeds offer the potential, perhaps for the first time, for a real-time issue analysis, i.e. one
based upon scanning current information sources.
There are two relevant research traditions for web data analysis, which will be described here
as purist and pragmatic. Either could potentially be suited to broad issue scanning. The purist
approach is to analyse an Internet phenomenon as it is found, seeking to describe it as accurately as
possible. The pragmatic approach is to analyse a phenomenon from the perspective of attempting to
gain information about an underlying phenomenon, rather than the Internet data (i.e. for indirect
research). The key difference between the two approaches is that the former typically does not use
any data cleansing whereas the latter tends to use extensive data cleansing. To illustrate the
difference, statistical physics researchers have taken samples of web page links in order to model the
web and web growth, without significant data cleansing (Barabási, 2002; Cothey, 2004; Huberman,
2001), whereas the information science tradition analysing samples of web page links in academic
sites may use various types of data cleansing (alternative counting methods, excluding duplicate
pages, excluding pages not authored by the web site owner) in order to better use links to infer
underlying scholarly communication patterns (Björneborn, 2001; Thelwall, 2004). Another Internet
example where data cleansing heuristics are necessary is web log file analysis, because of the need to
filter out the activities of web crawlers in order to understand human visiting or browsing patterns
(e.g., Spink, Wolfram, Jansen, & Saracevic, 2001; Wheeldon & Levene, 2003). Data cleansing is to
be avoided unless strictly necessary because it is a time consuming, labour-intensive process,
however (Kim, Choi, Hong, Kim, & Lee, 2003; Pyle, 1999). It must also be conducted carefully in
academic research because it tends to involve human judgement and/or heuristics (e.g., Hernandez &
Stolfo, 1998; Li, Zhang, & Zhang, 2003; Shahri & Barforush, 2004).
In this study, we seek to assess the suitability of RSS feeds as a data source for broad issue
scanning, using public concern about science policy as a case study. We take a purist approach,
avoiding significant data cleansing, despite the indirect nature of the task (as defined above). A purist
approach is useful for exploratory research (e.g., Stokes, 1997) in order to gain insights into the
phenomenon of RSS feeds, and also to check whether data cleansing is needed. The insights gained
Page 3 of 16
can be used in future pragmatic research in order to design effective data cleansing strategies, if these
are proven necessary. The objective of this preliminary research is therefore primarily descriptive: to
describe RSS feeds from the perspective of identifying emerging topics relative to a broad issue. A
case study approach is taken, seeking to identify emerging debates relevant to the broad issue of
public science concerns.
Background and related research
Issue tracking
Within information science there are traditions for analysing the structure or information through the
analysis of documents, particularly for the scientific literature. In bibliometrics, for example,
scientific fields may be illustrated through diagrams, generated from citations between relevant
journal articles (Small, 1999) co-authorship patterns (White & Griffith, 1982) or the co-occurrence of
words in the titles of academic papers (Leydesdorff, 1989). An important trend in bibliometrics is a
concern with causative factors to explain the patterns discovered, often investigated through
qualitative analysis (Borgman & Furner, 2002; Cronin, 1984). Bibliometric results can fruitfully be
compared over different time periods in order to study the evolution of scientific fields or broader
aspects of science, such as patterns of international communication (Glänzel, 2001). In addition, there
are also a number of studies that analyse the growth of scientific literature, either globally or within
specific fields (Price, 1963).
Issue tracking is the task of monitoring a general issue over time. Early issue tracking showed
that the growth in an issue could be retrospectively analysed through keyword tracking in publication
databases. A case study of “acid rain” was used, and the tracking was conducted through academic
publication databases (Lancaster & Lee, 1985). The method was successful in terms of producing
interesting results, but revealed limitations, including the difficulty of identifying relevant papers
before the topic terminology became universally accepted. Similar methods can now be used on a
wider range of databases, so issue tracking is not restricted to academic topics and the innovative
selection of databases may reveal the diffusion of ideas between different sectors of society
(Wormell, 2000). A generic limitation of issue tracking based upon publication databases, however, is
its retrospective nature.
The web can be conceptualised as a large database, but its disadvantage for issue tracking is
that web pages can die out, so the web is not an accurate historical record. The web can be a good
source for a snapshot impression of an issue (Thelwall, Vann, & Fairclough, 2006), but to be used for
historical information, researchers must set up experiments to capture or monitor collections of web
pages (Bar-Ilan & Peritz, 2004; Koehler, 2004; Rousseau, 1999), must accept the limitations of search
engines’ memories (Leydesdorff & Curran, 2000; Hellsten, Leydesdorff, & Wouters, 2006; Wouters,
Hellsten, & Leydesdorff, 2004), or must use a large scale repository, such as the Internet Archive’s
(archive.org), or its Wayback Machine time series search facility. As yet, however, there is no easily
accessible and reliable method of generating accurate time series data from the web without long term
data collection exercises.
Note that issue tracking is different to the computer science tasks of topic identification and
tracking, which are typically designed for the discovery of unknown topics and subsequently tracking
them in a corpus, such as one based upon newswire feeds (Clifton, Cooley, & Rennie, 2004). The
concept of a topic is much narrower than that of a broad issue. Topic identification has also been
applied to search engine logs (Ozmutlu & Cavdur, 2005), giving a problem with some common
characteristics. Similar techniques, time series analysis of word frequencies, are used in linguistics to
identify changes in language use (e.g., Meibauer, Guttropf, & Scherer, 2004).
RSS feeds: A technical overview
The RSS format is an XML initiative designed to exchange summary information in a compact
format. Its model is the syndication of news stories provided by companies like Reuters. Although
many versions of RSS are commonly in use, there are two major types that forked from version 0.9: a
simple type and a more sophisticated version (Hammersley, 2005; Hammond, Hannay, & Lund,
2004). The simple type, leading to the Atom format (www.atomenabled.org), was designed for
Page 4 of 16
publishing general content, such as web site and blog updates. The sophisticated variant, RSS 1.0, is
extensible, using the Resource Description Framework (RDF) to allow standardised information to be
published as metadata, including, but not limited to, the Dublin Core initiative (dublincore.org). This
has potential semantic web applications (Karger & Quan, 2004). As an example application, the
Publisher Requirements for Industry Standard Metadata (PRISM, www.prismstandard.org) initiative
allows journals to use a common structured format to publish article metadata in RSS feeds (e.g.,
<prism:volume> for the volume number of an article). Feeds are typically automatically
produced by a special purpose program or can be built into other content management programs such
as blog software (e.g., blogger v5.0).
In essence, an RSS feed is an URL that returns an XML document in one of the accepted RSS
formats, possibly with informal additions, containing at its heart a list of ‘items’ carrying its main
content. Each item is typically a summary of a distinct piece of information, such as the following
two which have been taken (and simplified) from different feeds.
<item>
<title>Toxic Waste Fire Evacuates 23,000 in Ark.</title>
<link>http://www.allheadlinenews.com/cgibin/news/news.cgi?id=1104704187</link>
<description>AllHeadlineNews.com Sun, 2 Jan 2005 22:15:07
GMT</description>
</item>
<item>
<title>Weather Report</title>
<link>http://bunsen.tv/2004/09/weather-report.html</link>
<description>People In Florida should probably find a
place to live.</description>
<dc:creator>Bunsen !</dc:creator>
<dc:date>2004-09-11T17:34:06Z</dc:date>
</item>
new
The list of items in an RSS feed will be periodically updated. For instance, every hour new items may
be added and the oldest ones in the list removed. Hence, to check for new content, RSS monitoring
software can compare the current list of items from the previous list obtained from the same feed,
reporting only the newer ones.
A feature of XML is that documents can contain text that the parser is intended to ignore,
flagged by the key word CDATA. In RSS feeds, HTML content can be placed inside RSS feeds via
the CDATA tag. This moves away from the original intention of RSS, which was to provide only
summary information, but is permitted, for example as shown in below. In principle the content could
be long, including a whole web page.
<content:encoded>
<![CDATA[<p><b>Simple </b>example</p>]]>
</content:encoded>
To illustrate a use of the CDATA field, the blog alterslash.org has offered two RSS feeds: a normal
one with just metadata (http://www.alterslash.org/rss.xml) and an “extended” field that included
almost the full content of the site home page in CDATA fields its embedded in a description field
(http://www.alterslash.org/rss_full.xml, e.g., 8 July 2005).
Quantitative RSS and blog analyses
For research purposes, some companies and academics have developed specialist RSS monitoring
systems to automatically track large numbers of RSS feeds. There is no authoritative single source for
RSS feed lists, but there are many web sites that host large databases designed to allow users to
search for relevant feeds. A large-scale RSS monitoring system must therefore use heuristics to
Page 5 of 16
identify and select its feeds. For example, a system may monitor all RSS feeds, or restrict its list to
news feeds or blog feeds.
As part of the development of blog and RSS monitoring systems, topic popularity time series
have been produced and analysed, using word frequency counts (Glance, Hurst, & Tomokiyo, 2004;
Gruhl et al., 2004; Kumar, Novak, Raghavan, & Tomkins, 2003). Kumar et al. (2003) analysed the
link structure of blogs, crawling them directly. They observed that discussions often came in bursts of
activity, with information propagating significantly between blogs. The way in which information
diffuses in blogspace has attracted particular attention (Adar, Zhang, Adamic, & Lukose, 2004; Gruhl
et al., 2004), particularly because blogspace is a rare example of an environment that can be
monitored and can host reasonably self-contained discussions so that direct evidence of information
diffusion can be collected. The research of Gruhl et al. (2004) used 11,804 RSS blog feeds (a total of
401k items) plus 14 RSS news channels. Blog topics were characterised into three general types: a
reasonably steady volume of chatter; ‘spiky’ chatter with occasional externally-induced significant
increases; and topics that are rarely discussed except when influenced by external events (Gruhl et al.,
2004). Presumably there are also cases where there is a general trend for increasing or decreasing
topic popularity, either sudden or gradual. From this research it is clear that it is possible to identify
popular topics from large collections of feeds, and that these topics will have different dynamics. Of
particular concern is the concept of resonance: whilst most postings attract no visible attention, some
seem to strike a chord and create a recordable burst of discussion.
In addition to the private feed monitoring systems of researchers and companies, there are
also some web sites that offer selected statistics from large corpuses of RSS feeds or blogs. For
example www.daypop.com offered a list of the top 20 blog words (www.daypop.com/burst/, 11 July,
2005) and the top 20 news words in terms of “heightened usage” (www.daypop.com/newsburst/, 11
July, 2005), the top 40 links in blogs (www.daypop.com/top/, 11 July, 2005) and the top 100 blogs in
terms of citations (links from other blogs). Another interesting list is the MIT Media lab’s link-based
“most contagious information currently spreading in the weblog community” (www.blogdex.net, 11
July, 2005). Other web sites also give statistical information about blogs, typically in the form of top
100 lists (e.g., blogstreet.com). Whilst these web sites can give useful insights for researchers, they
are not an optimal choice for RSS or blog research because the information they give is restricted to
what the developers wish to make available and they typically do not publish the full details of their
methodologies. In particular, the origins of the corpus and information about low frequency words or
unpopular feeds would be difficult to get from statistical web sites, simply because it would probably
not be in their commercial interests to provide information that would be of little value to most
people.
Qualitative blog analyses
Although at the time of writing there did not seem to be any qualitative research into RSS feeds, there
is a huge body of qualitative blog research (e.g., Huffaker & Calvert, 2005; Matheson, 2004), which
explores issues such as politics, journalism and language use. The most relevant study is that of Kutz
and Herring (2005), who used repeated downloading of news web sites, once per minute. They
employed content analysis to analyse how individual news stories were updated. The monitoring of
sources on a minute-by-minute basis allowed them to identify interesting short-term changes such as
the addition of ideology to news stories after initial fact-driven versions had been posted.
Method
Data Collection: The RSS monitor and evaluator system
A new RSS feed monitoring and processing system, the Mozhdeh RSS monitor, was constructed to
gather the data used in this paper. It was based upon existing automatic methods (Gruhl et al., 2004)
but with additions and modifications for the new task. The raw data for this project was a collection
of 19,587 RSS feeds (almost double the number of the Gruhl paper) culled from a wide range of
sources including Google searches, RSS feed sites and major online news sources. A special effort
was made to identify as many science and technology-related feeds as well as personal blog feeds.
Preference was given to English language feeds but non-English feeds were not excluded. The feeds
Page 6 of 16
are therefore an ad-hoc collection tailored to the task. Each feed was polled on an hourly basis,
recording the time of polling and all new feed items (i.e. items that were not present in the previous
feed from the same source). A basic algorithm was implemented to decrease the polling frequency of
infrequently updated feeds – an ethical consideration to avoid unnecessary usage of others’
computing resources.
A key difference with the Gruhl system is that the program can be primed with a Boolean
expression (e.g., “dog AND (cat OR kitten)”) and will then perform all subsequent analysis on only
the RSS items matching the set expression. This feature allows the identification of topics that are
relevant to a given broad issue, as characterised by the chosen Boolean expression.
A second important difference is that word frequencies alone are used to generate time
series for the identified postings (i.e. without natural language processing (NLP) techniques,
ontologies, links or thesauri). These and statistical information content measures (e.g.,
Sebastiani, 2002; Yang & Pedersen, 1997) are avoided in order to obtain intuitive, fast data
and to minimise the risk of missing new topics because they centre on new terms, or use
language in original ways that could mislead linguistic techniques. This is a specific concern
for science policy debates, which have previously been known to play with language as part
of their rhetorical strategy (e.g., Hellsten, 2003).
Data was collected from November 2004, but analysed from February 5, 2005 to April 6, 2005,
after the corpus had reached full size. This produced a total of 5,776,263 separate RSS feed items.
Dynamic term-based indexes were implemented that catalogue for each feed item the list of words
contained, the owning feed and the posting date. For each word, the identifier of containing each feed
item was indexed, allowing the rapid generation of word-based time series. The index was used to
automatically select the apparent ‘science debate’ postings using the method above, producing a total
of 19,175 items.
Analysis
A heuristic was used to identify postings (items) that were most likely to be related to the broad issue
of public science concern. Items were first identified as science-relevant if they contained one of the
words {science scientist scientists scientific research researcher researchers researching researched}.
They were additionally identified as relating to public science concerns if they also contained one of
the following set of concern words {argue argued fear afraid worry worried concern concerned
frightened scare scared risk risked risky}. The word lists were constructed by introspection and
scanning postings judged to be about science debates. The collection of items containing at least one
word in each of the two sets is labelled the science concern corpus. The words in this corpus are cowords in the sense of co-occurring with one ‘science’ word and one ‘concern’ word.
A time series was generated for each word occurring in the science concern corpus: for each day
the proportion of science concern feed items containing the word was calculated (hereafter: relative
word frequency). A debate may be characterised by an increase in the quantity of discussion around a
topic and hence a logical method to identify individual science concern debates is to search for time
series that significantly increase in value. Although debates may increase rapidly or slowly build up,
as a practical consideration only rapidly increasing debates were sought. Four different methods for
identifying words indicative of debates were assessed, as described below, and motivated by Gruhl et
al. (2004). In all cases the first 80 days (whilst the RSS feed list was being built) were not tested for
debates but were used in the calculation of average word frequencies.
 Spike: The relative word frequency r(d) on a given day d was at least 5 times higher than the
average relative word frequency of all previous days


1 d 1
 r (i) .
d  1 i 1
Short burst: The minimum r(d,3) = min{r(d), r(d+1), r(d+2)} of the relative word frequencies
on three consecutive days d, d+1, d+2 was at least 5 times higher than the average relative
word frequency of all previous days.
Medium burst: The minimum r(d,5) of the relative word frequencies on five consecutive days
d, d+1, d+2, d+3, d+4 was at least 5 times higher than the average relative word frequency of
all previous days.
Page 7 of 16

Long burst: The minimum r(d,9) of the relative word frequencies on nine consecutive days d,
d+1,… d+8 was at least 5 times higher than the average relative word frequency of all
previous days.
Four different methods were chosen because preliminary experiments with methods found none to be
clearly successful and so it was necessary to evaluate a set of sensible choices. The minimum relative
word frequency of 5 was selected heuristically: other values were tried but did not give better results.
A list of burst/spike words was identified by each of the above co-word selection methods
and ordered by the size of the largest word frequency difference in the time series (i.e. r(d), r(d,3),
r(d,5), or r(d,9)). The choice of ordering by absolute rather than relative word frequency difference
was because many words had an effective word frequency relative increase of infinity because they
were first used after the 80 day lead in period of the corpus. Two types of analysis were performed.
1. For each of the four lists, the top 20 words were selected and investigated to find out why they
had a high frequency. The purpose of this was to assess the time series algorithm to see whether it
was working in the intended way. The checking was performed by identifying the RSS items
containing the word.
2. For the first choice method, short bursts, the top 1000 terms were checked to see whether they
directly referred to a science debate. Only nouns and unknown words were checked. The
checking was again performed by identifying the RSS items containing the word.
Results
Spikes (1 day)
The top results were dominated by a few blog threads mainly from the alterslash.org site. In
alterslash.org, contributors are allowed to post stories and others can then post comments on them,
starting a thread. Active threads result in the text of early posts being highly replicated and reposted.
Blogs like alterslash.org with multiple contributors also have the advantage of being more active.
Table 1 is notable for the inclusion of relatively few nouns or content-loaded terms, for example
“ago” and “didn’t” would give little clue as to the topic. Related to this point, it is clear that one
prolific thread could generate many high frequency words from the early posts. One single RSS feed
item for the shuttle story (mentioned in Table 1) is summarised below, with […] indicating sections
cut out of this very long item. Each later posting to this thread included all previous postings.
Posted by Zonk (29% noise) View
Somegeek writes “SpaceDaily.com is running a story that NASA never performed a formal
risk analysis of a shuttle mission to rescue the Hubble Space Telescope […]
Little to do with safety - by CaptDeuce (Score: 4, Interesting) Thread
… previous NASA administrator Sean O’Keefe made the decision “based on what he
perceived was the risk”. This perceived risk is in performing a manned shuttle mission that is
out of range of using the International Space Station as an emergency refuge. …
Loose consensus at sci.space.tech is that O’Keefe’s decision has virtually nothing to do with
safety and everything to do with the extremely tight schedule necessary to complete ISS
(International Space Station). […]
Page 8 of 16
Table 1. Top spike words in the science concern corpus.
r(d)
0.376
Word
noise
0.372
0.372
0.369
0.367
0.364
posted
thread
score
writes
provide
0.356
0.337
view
cowboyneal
0.334
months
0.333
0.332
0.328
0.325
0.325
0.317
decided
informative
ago
instead
current
didn't
0.315
production
0.31
zonk
0.306
paper
0.3
able
Description
Standard text in posts in prolific blogs:(percentage of noise
in post)
Standard text in posts in prolific blogs: (“posted by…”)
Standard text in posts in prolific blogs
Standard text in posts in prolific blogs
Standard text in posts in prolific blog
Coincidence: occurs in the original text of three separate
topics spawning threads on the same day 24/4/05
Standard text in posts in prolific blogs: “click to view”
Name of blog poster: post was extensively replicated and
commented
Blog thread about a new technical magazine; Linux life
expectancy thread
Shuttle story blog thread "…decided to cancel the shuttle…"
Standard text in posts in prolific blog: (post rating)
Shuttle story blog thread
Shuttle story blog thread
Shuttle story blog thread
Coincidence: occurs in several blog threads on the same
day (e.g., smoking, AIDS)
Coincidence: two blog threads used this word on the same
day
Name of blog poster: post was extensively replicated and
commented
Nanotechnology Blog thread (“nanotechnology paper-like
display”)
Several blog threads
Where
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
(mainly)
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
e.g.
deanesmay.com
alterslash.org
alterslash.org
alterslash.org
alterslash.org
Fig. 1. Science concern co-word time series for “ago” and “current”, sharing a common main spike
due to both occurring in the original post for a space shuttle story thread.
Short bursts (3 days)
The top results were dominated by standard text in blog threads from a few sites. Individual stories in
alterslash.org did not feature since these tended to last for a maximum of one or two days. In
livejournal.com, however, the feeds were very long and cumulative, rather than posting just the new
content. This reposting allowed the site to dominate the results for longer periods of time, despite
being less active than alterslash.org. Coincidence is also evident in Table 2 with some words being
present as a result of being used in multiple unrelated threads.
Page 9 of 16
Table 2. Top three day burst words in the science concern corpus.
0.199
Word
posted
0.175
0.175
march
noise
0.162
informative
0.155
0.152
0.145
thread
score
funny
0.139
insightful
0.132
couldn't
0.116
0.114
0.111
april
they've
sounds
0.104
0.098
0.087
february
successful
windows
0.087
senate
Month
Blog thread (women leaving IT)
Blog threads (three separate threads, using
different meanings of the word “sound”)
Date
Blog threads (several separate ones.)
Multiple Microsoft threads plus stories with
glass windows
Repost “threads” (political story)
0.086
0.085
allowing
opposed
Blog threads (several separate ones)
Blog threads
0.079
hostile
Blog threads
0.075
trouble
Blog threads
r(d,3)
Description
Standard text in posts in prolific blogs: (“posted
by…”)
Month and protest march
Standard text in posts in prolific blogs:
(percentage of noise in post)
Standard text in posts in prolific
blogs:('informative' post rating)
Standard text in posts in prolific blogs
Standard text in posts in prolific blogs
Standard text in posts in prolific blogs: (“funny”
post rating)
Standard text in posts in prolific blogs:
(“insightful” post rating)
From burst of story posts
Where
alterslash.org
Many
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
alterslash.org
Many, e.g. www.livejournal.com/
users/mpoetess/
Many
alterslash.org
alterslash.org
Many
alterslash.org
alterslash.org plus others
twistedchick (livejournal.com)
plus others
alterslash.org
rozk and twistedchick
(livejournal.com)
rozk and twistedchick
(livejournal.com)
rozk and twistedchick
(livejournal.com)
Fig. 2. A science concern co-word time series for “wikipedia”, with a burst at the end of February
caused by 3 separate consecutive stories.
Figure 2 is a time series for the word wikipedia (rank 25), which was discussed in two unrelated
threads (“fud-based encyclopaedias” followed by “interview with Lawrence Lessing”) over three
Page 10 of 16
consecutive days at the end of February. In any large data collection, some coincidences are to be
expected as statistical phenomena.
Medium bursts (5 days)
The top medium bursts were dominated by alterslash.org, and by twistedchick and rozk from
livejournal.com. The occurrence of prolonged reposting in livejournal.com gave the continual
reappearance of old stories as well as very large RSS items.
Long bursts (9 days)
The top long bursts were dominated by twistedchick and rozk from livejournal.com. The one
exception in the top 20 was a prolonged thread “steps to a quieter PC” from alterslash.org with many
contributors offering opinions. Even for the longer bursts, nouns were not ubiquitous. For example,
the top 10 terms are: unhappy; william; deputy; hostile; remarks; america's; exam; remark;
reactionary; irresponsible.
Results scanning
Despite the dominance of useless words at the top of all the word lists, it was possible to manually
scan the word lists and test any word that seemed promising as an indicator of a genuine debate. This
method for the top 1,000 terms in the 3-day burst list produced only two genuine topic indicating
words: schiavo (rank 164); and ozone (rank 353). Figure 3 illustrates the schiavo topic through a time
series in the science concern data set (73 matches with few repeated thread posts) and Figure 4 gives
equivalent series for the full data set (6490 matches). The term “schiavo” refers to the case of Terri
Schiavo, which spawned a genuine public debate. The first science concern item illustrates the
political side of the case and why it became significant: reading this in conjunction with Figure 4
provides illuminating insights into this debate.
The Terri Schiavo case has transfixed the right wing media while attracting comparatively
little attention from the left. […] Michael Schiavo is Terri's legal guardian, a court has found
repeatedly that Terri wouldn't want a feeding tube, and Michael asked the doctors to take the
tube out. That's really all there is to it.
The Terri Schiavo appeal is a vicious and well-funded propaganda campaign. Terri's parents
and their allies are using pseudoscience and character assassination to destroy Michael
Schiavo. The right wing is eating it up.
If progressives don't counter these blatant misrepresentations now, the Terri Schiavo myths
will be used against us for years to come. (http://www.pandagon.net/mtarchives/004689.html)
Note that although this item did not use any of the listed science words, it was still selected as a
science story because of its science meta-tag <dc:subject>Science</dc:subject>
contained within the feed item.
Page 11 of 16
Figure 3. A science concern co-word time series for “schiavo” (3-day burst data).
Figure 4. A time series for the term “schiavo” in the full data set.
The comparison of Figures 3 and 4 is interesting for several reasons. First, only a hundredth of the
postings were identified as science concern related. This story seems to be primarily medical and
political, rather than scientific, in the sense that the discussion concerns a routine decision taken that
does not relate to new technologies. Second, the evolution of the story is better seen from the larger
data set. The time series is smoother as a result of the greater amount of data. Nevertheless, the
Schiavo case could not have been easily identified from the full data set because it would have been
surrounded by many non-science topics, so would have ranked much lower in a list of general topic
bursts. It should be noted that this is an ideal case for word frequency analysis because the term
schiavo apparently occurred only in connection with the debate.
Discussion
The results are dominated by words that are not good indicators of science concern debates. Two
genuine debates were identified by investigating the top 1000 words in the short burst list but it seems
likely that with data cleansing the results would be significantly better. The main cause of non-useful
terms was a small set of blogs with highly repetitive item generation policies, either through active
threads or through reposting of old stories. In consequence, some form of data cleansing now seems
unavoidable for broad issue scanning, despite the inevitable loss of some data and extra computing
power requirements.
The method used, employing Boolean expressions to identify postings that are potentially
relevant to a broad issue, necessarily shrinks the effective RSS corpus size, effecting data cleansing.
This is clear from a comparison of the results with a similar exercise identifying science-relevant
items (but not necessarily concerns). The top words for bursts contained more nouns and terms
indicative of content, although still mixed with more general terms (resort; untitled; snes; dose;
christine; apr; heck; creatures; reasonable; grand; mp3; mood; hunger; insomnia; lowered; intending;
handle; yard; resolve; serotonin). In fact there was a distinct medical and technological flavour to the
list of top terms. It seems that with a bigger effective corpus size, individual feeds are less likely to
dominate the results: data cleansing is most important for larger corpora.
There are several logical alternative options for data cleansing, some of which are described
below. Further experiments are needed to assess the effectiveness of each one.
 Excluding spam feeds. Some RSS feeds are Spam in the sense of being automatically
generated advertising (e.g. thousands of items starting with “have you considered buying…”)
and could be eliminated to improve the overall quality of the data set and to speed up the
process of data analysis. Spam elimination is common in Internet applications, including
search engines and email (Stitt, 2004), and hence it is a logical choice.
Page 12 of 16




Including only specified fields. It would be possible to process only feeds thought to contain
high quality metadata, such as the title field and perhaps also a description field. This would
have the advantage of giving more concise data but would have the disadvantage that key
words (e.g. science, fear) could be omitted from key fields even though they would be
relevant to the post, hence reducing the overall number of broad issue-relevant items
identified, which itself is a problem..
Limiting the number of words per feed item (e.g., the first or last 100 words). This would stop
the domination of very large feeds but, as above, would result in less broad issue-relevant
items being identified.
Automatic identification of threads in feeds, and removal of previous postings in thread feeds.
This is a brute force approach and would be non-trivial computer science exercise to
implement since there will be many different formats of RSS feeds with similar problems.
Moreover, any algorithm would probably have to be updated as new feed formats and
threading applications emerged.
Counting word frequencies by feed (per day) rather than by feed item. This is an attractive
option because it would stop any word having a frequency higher than 1 per day based upon
its reappearance in threads in different items but the same RSS feed. This technique would be
relatively easy to implement and could be applied to all blog feeds, avoiding the need for
continual software maintenance as new problematic feeds are identified. It is probably a
second-best option from a data quality perspective because, ideally, it would be desirable to
capture interest in a topic expressed in multiple postings within a thread, and this would stop
that from being possible. This is a variant of the alternative document models previously used
for link analysis (Björneborn, 2001; Henzinger, 2001; Thelwall, 2002).
Limitations
As with any case study research, there are many limitations about the extent to which the findings can
reliably be generalised. The science concerns case study was able to reveal problems that seem likely
to occur for any other broad issue, however, although the extent of the data cleansing problem will
probably vary by exact choice of broad issue. In particular the extent of discussion of the broad issue
is critical: for larger broad issues, it would be more difficult for individual feeds to dominate the
results. The following additional limitations are acknowledged.
 The RSS feed corpus itself was chosen in an ad-hoc manner and, given the influence of a few
individual blogs, a different corpus might not have had any anomalous blog feeds and could have
given better results.
 Only a limited range of options has been explored and many settings have been determined
heuristically. This is common practice in complex computing systems (Gruhl et al., 2004) and
unavoidable in a project with many possible variations but is still a limitation.
 The modelling assumption that a step change in discussion level will signal an emerging debate
rather than a gradual increase in debate has not been verified with real data.
 The methodology does not address the issue of recall: the proportion of real debates that were
identified. It is possible that debates were missed because they did not cause a step change or
occurred around words that were already relatively frequent (e.g., Microsoft). This is partly
unavoidable, since if there were a definitive list of public mini-debates on science then the system
would not be needed.
Conclusions
This paper sought to assess whether a purist approach to RSS feeds (i.e. using the raw feeds without
data cleansing) is suitable for broad issue scanning, using a co-word frequency time series approach.
The domination of the results by non-useful terms in the science concern case study showed that data
cleansing is necessary for efficient broad issue scanning. Raw RSS feeds are unsuitable because some
feeds carry extensive and repetitive content. This is a particular concern for small broad issues that do
not attract a large amount of discussion. Whilst commercial companies may be able to reduce
necessary data cleansing by maintaining a very large collection of RSS feeds, smaller-scale
applications do not have this option. The use of data cleansing techniques should allow future
Page 13 of 16
researchers to identify emergent debates with less effort, by producing tables of bursty co-words with
a higher proportion of genuine topics. The Terri Schiavo case showed that useful information is
available in RSS feeds, once topics have been identified, and also that our broad issue scanning
method was only able to identify a fraction of postings on this topic: hence the full set of feeds should
be used to investigate topics, once identified. Nevertheless, it is unlikely that a perfect system can be
developed that would automatically identify emergent debates relevant to any given broad issue
because of coincidences and topic ambiguity. Hence the goal of creating lists of keywords potentially
indicating new debates for subsequent manual filtering seems realistic. Finally, it would be interesting
to assess the extent to which natural language processing, thesaurus and ontology techniques can be
employed to improve results, and whether this would be worth the performance degradation that they
would probably introduce.
Acknowledgements
The work was supported by a European Union grant for activity code NEST-2003-Path-1. It is part of
the CREEN project (Critical Events in Evolving Networks, contract 012684). We thank the reviewers
for their helpful comments.
References
Adar, E., Zhang, L., Adamic, L., & Lukose, R. (2004). Implicit structure and the dynamics of
blogspace. Workshop on the Weblogging Ecosystem at the 13th International World Wide
Web Conference, http://www.sims.berkeley.edu/~dmb/blogging.html.
Barabási, A. L. (2002). Linked: The new science of networks. Cambridge, Massachusetts: Perseus
Publishing.
Bar-Ilan, J. (1997). The 'mad cow disease', Usenet newsgroups and bibliometric laws. Scientometrics,
39(1), 29-55.
Bar-Ilan, J., & Peritz, B. C. (2004). Evolution, continuity, and disappearance of documents on a
specific topic on the Web: A longitudinal study of 'informetrics'. Journal of the American
Society for Information Science and Technology, 55(11), 980 - 990.
BBC. (2005). Microsoft makes web feeds easier. http://news.bbc.co.uk/1/hi/technology/4621223.stm.
Björneborn, L. (2001). Necessary data filtering and editing in webometric link structure analysis:
Royal School of Library and Information Science.
Blood, R. (2004). How blogging software reshapes the online community. Communications of the
ACM, 47(12), 53-55.
Borgman, C. L., & Furner, J. (2002). Scholarly communication and bibliometrics. Annual Review of
Information Science and Technology, 36, 3-72.
Burnett, R., & Marshall, P. (2002). Web theory: An introduction. London: Routledge.
Cairns, G., Wright, G., Bradfield, R., van der Heijden, K., & Burt, G. (2004). Exploring egovernment futures through the application of scenario planning. Technological Forecasting
and Social Change, 71(3), 217-238.
Caldas, A. (2003). Are newsgroups extending 'invisible colleges' into the digital infrastructure of
science? Economics of Innovation and New Technology, 12(1), 43-60.
Clifton, C., Cooley, R., & Rennie, J. (2004). TopCat: Data mining for topic identification in a text
corpus. IEEE Transactions On Knowledge And Data Engineering, 16(8), 949-964.
Corbett, J. B., & Durfee, J. L. (2004). Testing public (un)certainty of science. Science
Communication, 26(2), 129-151.
Cothey, V. (2004). Web-crawling reliability. Journal of the American Society for Information Science
and Technology, 55(14), 1228-1238.
Cronin, B. (1984). The citation process: The role and significance of citations in scientific
communication. London: Taylor Graham.
Gill, K. E. (2004). How can we measure the influence of the blogosphere? Paper presented at the
WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and
Dynamics.
Page 14 of 16
Glance, N. S., Hurst, M., & Tomokiyo, T. (2004). BlogPulse: Automated trend discovery for weblogs.
Paper presented at the WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation,
Analysis and Dynamics.
Glänzel, W. (2001). National characteristics in international scientific co-authorship relations.
Scientometrics, 51(1), 69-115.
Gruhl, D., Guha, R., Liben-Nowell, D., & Tomkins, A. (2004). Information diffusion through
Blogspace. Paper presented at the WWW2004, New York,
http://www.www2004.org/proceedings/docs/1p491.pdf.
Hagendijk, R. (2004). Framing GM food: Public participation and liberal democracy. EASST Review,
23(1), 3-7.
Hammersley, B. (2005). Developing feeds with RSS and Atom. Sebastopol, CA: O'Reilly.
Hammond, T., Hannay, T., & Lund, B. (2004). The role of RSS in science publishing: Syndication
and annotation on the web. Dlib, 12,
http://www.dlib.org/dlib/december04/hammond/12hammond.html.
Hellsten, I. (2003). Focus on metaphors: The case of "Frankenfood" on the web. Journal of Computer
Mediated Communication, 8(4), http://www.ascusc.org/jcmc/vol8/issue4/hellsten.html.
Hellsten, I., & Leydesdorff, L. (2004). Measuring the meaning of words in contexts: An automated
analysis of controversies about 'Monarch butterflies,' 'Frankenfoods,' and 'stem cells.' Paper
presented at the Sixth Intern. Conf. on Social Science Methodology (RC33), Amsterdam, 1720 August http://users.fmg.uva.nl/lleydesdorff/meaning/measuring%20meaning.pdf.
Hellsten, I., Leydesdorff, L., & Wouters, P. (2006, to appear). Multiple presents: How search engines
re-write the past. New Media & Society. Retrieved 12 September 2005 from:
http://users.fmg.uva.nl/lleydesdorff/searcheng/
Henzinger, M. R. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50.
Hernandez, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the
merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 9-37.
Hine, C. (2000). Virtual Ethnography. London: Sage.
Huberman, B. A. (2001). The Laws of the Web: Patterns in the Ecology of Information. Cambridge,
MA: The MIT Press.
Huffaker, D. A., & Calvert, S. L. (2005). Gender, identity, and language use in teenage blogs. Journal
of Computer-Mediated Communication, 10(2),
http://jcmc.indiana.edu/vol10/issue12/huffaker.html.
Karger, D. R., & Quan, D. (2004). What would it mean to blog on the Semantic Web? Lecture Notes
in Computer Science, 3298, 214-228.
Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Lee, D. (2003). A taxonomy of dirty data. Data
Mining and Knowledge Discovery, 7(1), 81-99.
Klintman, M. (2002). The genetically modified (GM) food labelling controversy: Ideological and
epistemic crossovers. Social Studies of Science, 32(1), 71-91.
Koehler, W. (2004). A longitudinal study of Web pages continued: a report after six years.
Information Research, 9(2), 174.
Koku, E., Nazer, N., & Wellman, B. (2001). Netting scholars: Online and offline. American
Behavioral Scientist, 44(10), 1752-1774.
Kumar, R., Novak, J., Raghavan, P., & Tomkins, A. (2003). On the bursty evolution of blogspace.
Paper presented at the WWW2003, Budapest, Hungary,
http://www2003.org/cdrom/papers/refereed/p477/p477-kumar/p477-kumar.htm.
Kutz, D., & Herring, S. C. (2005). Micro-longitudinal analysis of Web news updates. Proceedings of
the Thirty-Eighth Hawai'i International Conference on System Sciences (HICSS-38),
http://ella.slis.indiana.edu/~herring/news.pdf.
Lancaster, F. W., & Lee, J. l. (1985). Bibliometric techniques applied to issues management - a casestudy. Journal of the American Society for Information Science, 36(6), 389-397.
Leydesdorff, L. (1989). Words and co-words as indicators of intellectual organization. Research
Policy, 18, 209-223.
Leydesdorff, L., & Curran, M. (2000). Mapping university-industry-government relations on the
Internet: the construction of indicators for a knowledge-based economy. Cybermetrics, 4,
http://www.cindoc.csic.es/cybermetrics/articles/v4i1p2.html.
Page 15 of 16
Leydesdorff, L., & Etzkowitz, H. (2003). Can “The Public” be considered as a fourth helix in
University-Industry-Government relations? Report of the fourth triple helix conference.
Science and Public Policy, 30(1), 55-61.
Leydesdorff, L., & Hellsten, I. (2005, to appear). Metaphors and diaphors in science communication:
Mapping the case of ‘stem-cell research’. Science Communication,
http://www.leydesdorff.net/stemcells.pdf.
Li, Y. F., Zhang, C. Q., & Zhang, S. C. (2003). Cooperative strategy for Web data mining and
cleaning. Applied Artificial Intelligence, 17(5-6), 443-460.
Matheson, D. (2004). Weblogs and the epistemology of the news: Some trends in online journalism.
New Media & Society, 6(4), 443-468.
Meibauer, J., Guttropf, A., & Scherer, C. (2004). Dynamic aspects of German -er-nominals: a probe
into the interrelation of language change and language acquisition. Linguistics, 42(1), 155193.
Nardi, B. A., Schiano, D. J., Gumbrecht, M., & Swartz, L. (2004). Why we blog. Communications of
the ACM, 47(12), 41-46.
Notess, G. R. (2002). RSS, aggregators, and reading the blog fantastic. Online, 26(6), 52-54.
Ozmutlu, S., & Cavdur, F. (2005). Neural network applications for automatic new topic
identification. Online Information Review, 29(1), 34-53.
Pikas, C. K. (2005). Blog searching for competitive intelligence, brand image, and reputation
management. Online, 29(4), 16-21.
Price, D.J. deSolla (1963) Little science, big science. NY, Columbia University Press.
Pyle, D. (1999). Data preparation for data mining. San Francisco, CA: Morgan Kaufmann.
Rousseau, R. (1999). Daily time series of common single word searches in AltaVista and
NorthernLight. Cybermetrics, 2/3,
http://www.cindoc.csic.es/cybermetrics/articles/v2i1p2.html.
Schaap, F. (2004). Multimodal interactions and singular selves: Dutch weblogs and home pages in
the context of everyday life. Paper presented at the AoIR 5.0, Brighton, UK.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys,
34(1), 1-47.
Shahri, H. H., & Barforush, A. A. (2004). A flexible fuzzy expert system for fuzzy duplicate
elimination in data cleaning. Lecture Notes in Computer Science, 3180, 161-170.
Small, H. (1999). Visualising science through citaiton mapping. Journal of American Society for
Information Science, 50(9), 799-813.
Smith, S. (2005). Tapping the feed: In search of an RSS money trail. Econtent, 28(3), 30-34.
Spink, A., Wolfram, D., Jansen, B. J., & Saracevic, T. (2001). Searching the web: The public and
their queries. Journal of American Society for Information Science, 53(2), 226-234.
Stitt, R. (2004). Curbing the Spam problem. IEEE Computer, 37(12), 8.
Stokes, D. E. (1997). Pascal's quadrant: Basic science and technological innovation. Washington,
D.C.: Brookings Institution.
Sunstein, C. R. (2004). Democracy and filtering. Communications of the ACM, 47(12), 57-59.
Thelwall, M. (2002). Conceptualizing documentation on the Web: An evaluation of different
heuristic-based models for counting links between university web sites. Journal of American
Society for Information Science and Technology, 53(12), 995-1005.
Thelwall, M. (2004). Link analysis: An information science approach. San Diego: Academic Press.
Thelwall, M., Vann, K., & Fairclough, R. (2006, to appear). Web issue analysis: An Integrated Water
Resource Management case study. Journal of American Society for Information Science and
Technology.
Vander Beken, T. (2004). Risky business: A risk-based methodology to measure organized crime.
Crime, Law and Social Change, 41(5), 471-516.
Wei, C. P., & Lee, Y. H. (2004). Event detection from online news documents for supporting
environmental scanning. Decision Support Sytems, 36(4), 385-401.
Wheeldon, R., & Levene, M. (2003). The best trail algorithm for assisted navigation of Web sites.
Paper presented at the 1st Latin American Web Congress (LA-WEB 2003), Sanitago, Chile.
White, H. D., & Griffith, B. C. (1982). Author co-citation: a literature measure of intellectual
structure. Journal of American Society for Information Science, 32(3), 163-172.
Page 16 of 16
Wormell, I. (2000). Critical aspects of the Danish Welfare State - as revealed by issue tracking.
Scientometrics, 48(2), 237-250.
Wouters, P. (2000). Cyberscience: The informational turn in science. Paper presented at the Lecture
at the Free University, Amsterdam.
Wouters, P. Hellsten, I., & Leydesdorff, L. (2004). Internet time and the reliability of search engines.
First Monday, 9(10). Retrieved 12 September 2005 from:
http://firstmonday.org/issues/issue9_10/wouters/index.html
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization.
In Proceedings of the 14th international conference on machine learning (ICML 1997) (pp.
412-420). Nashville, TN.
Download