A comparison of methods for collecting web citation data for

advertisement
A comparison of methods for collecting web citation data
for academic organisations
Mike Thelwall
Statistical Cybermetrics Research Group, School of Technology, University of Wolverhampton,
Wulfruna Street, Wolverhampton WV1 1SB, UK.
E-mail: m.thelwall@wlv.ac.uk
Tel: +44 1902 321470 Fax: +44 1902 321478
Pardeep Sud
Statistical Cybermetrics Research Group, School of Technology, University of Wolverhampton,
Wulfruna Street, Wolverhampton WV1 1SB, UK.
E-mail: p.sud@wlv.ac.uk
Tel: +44 1902 328549 Fax: +44 1902 321478
The primary webometric method for estimating the online impact of an organisation is to count
links to its web site. Link counts have been available from commercial search engines for over a
decade but this was set to end by early 2012 and so a replacement is needed. This article
compares link counts to two alternative methods: URL citations and organisation title mentions.
New variations of these methods are also introduced. The three methods are compared against
each other using Yahoo. Two of the three methods (URL citations and organisation title
mentions) are also compared against each other using Bing. Evidence from a case study of 131
UK universities and 49 US Library and Information Science (LIS) departments suggests that
Bing's Hit Count Estimates (HCEs) for popular title searches are not useful for webometric
research but that Yahoo's HCEs for all three types of search and Bing’s URL citation HCEs
seem to be consistent. For exact URL counts the results of all three methods in Yahoo and both
methods in Bing are also consistent. Four types of accuracy factors are also introduced and
defined: search engine coverage, search engine retrieval variation, search engine retrieval
anomalies, and query polysemy.
Introduction
One of the main webometric techniques is impact analysis: using web-based methods to estimate the
online impact of documents (Kousha & Thelwall, 2007b; Kousha, Thelwall, & Rezaie, 2010;
Vaughan & Shaw, 2005), journals (Smith, 1999; Vaughan & Shaw, 2003), digital repositories
(Zuccala, Thelwall, Oppenheim, & Dhiensa, 2007), researchers (Barjak, Li, & Thelwall, 2007; Cronin
& Shaw, 2002), research groups (Barjak & Thelwall, 2008), departments (Chen, Newman, Newman,
& Rada, 1998; Li, Thelwall, Wilkinson, & Musgrove, 2005b), universities (Aguillo, 2009; Aguillo,
Granadino, Ortega, & Prieto, 2006; Qiu, Chen, & Wang, 2004; Smith, 1999) and even countries
(Ingwersen, 1998; Thelwall & Zuccala, 2008). The original and most widespread approach is to count
hyperlinks to the object studied, normally using an advanced web search engine query. This search
facility was apparently due to cease by 2012, however, since Yahoo!, the only major search engine to
offer a link search service, was taken over by Microsoft (BBC, 2009), which had previously closed
down the link search capability of its own search engine Bing (Seidman, 2007). More specifically,
“Yahoo! has transitioned to Microsoft-powered results in the U.S. and Canada; additional markets
will follow throughout 2011. All global customers and partners are expected to be transitioned by
early 2012” (Yahoo, 2011a). This transition apparently includes phasing out link search because the
linkdomain command stopped working before April 2011 in the US and Canadian version of Yahoo.
For instance, the query linkdomain:wlv.ac.uk -site:wlv.ac.uk returned correct results
in the UK version of Yahoo (April 13, 2011) but only webometrics pages containing the query term
“linkdomain:wlv.ac.uk” in the US/Canada version (April 13, 2011). Moreover, Yahoo
stopped supporting automatic searches, as normally used in webometrics, in April 2011 (Yahoo,
2011b), leaving no remaining automatic source of link data from search engines. Hence it is important
to develop and assess online impact estimation methods as replacements for link searches.
Several alternative online impact assessment methods have already been developed and
deployed in various contexts. Link counts have been estimated using web crawlers rather than search
engines (Cothey, 2004; Thelwall & Harries, 2004). A limitation of crawlers is that they can only be
1
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
used on a modest scale because of the time and computer resources needed. In particular, this
approach cannot be used to estimate online impact from the whole web, which is the objective of most
studies. Other projects have used two different types of search to estimate online impact: web
mentions and URL citations. A web mention (Cronin, Snyder, Rosenbaum, Martinson, & Callahan,
1998) is a mention in a web page of the object being investigated, such as a person (Cronin et al.,
1998) or journal article (Vaughan & Shaw, 2003). Web mention counts are estimated by submitting
appropriate queries to a search engine. For instance the query might be a person’s name as a phrase
search (e.g., “Eugene Garfield”). The drawback of web mention searches is that they are often not
unique and therefore give some spurious matches (e.g., to Eugene Garfields other than the famous
bibliometrician in the above case). Another drawback is that there may be multiple equivalent
descriptions (e.g., “Gene Garfield”), which complicates the analysis. Finally, a URL citation is the
mention of the URL of a web page or web site in another web page, whether accompanied by a
hyperlink or not. URL citation counts can be estimated by submitting URLs as phrase searches to
search engines. The principal disadvantage of URL citation counts is conceptual: including URLs in
the visible text of web pages seems to be unnatural and so it is not clear that they are a reasonable
source of online impact evidence, except perhaps in special cases like articles (see below).
This article uses (web) organisation title mentions as a metric for organisations. This is really
just new terminology for an existing measurement (web mentions) that has not been used in the
context of organisations for impact measurements before, but has been used as part of a “Word cooccurrences” indicator of the similarity of two organisations (Vaughan & You, 2010), as described
below. This study uses a set of UK universities and a set of US LIS departments to compare link
counts, organisation title mention counts and URL citation counts against each other to assess the
extent to which they have comparable outputs. It compares the results between Bing and Yahoo to
assess the consistency of these search engines. It also introduces some methodological innovations.
Finally, all the methods, including some methodological innovations, are incorporated into the free
Webometric Analyst software (lexiurl.wlv.ac.uk, formerly known as LexiURL Searcher) to make
them easily available for the webometric community.
Organisation title mentions as a Web impact measurement
The concept of impact and methods for measuring it are central to this article, but both are subject to
disagreement and opposition (see e.g., MacRoberts & MacRoberts, 1996; Seglen, 1998). The term
impact seems to be typically used in bibliometrics either as a general term and almost synonymous
with importance or influence, or as a technical term and synonymous with citation counts, as in the
phrases citation impact or normalised citation impact (Moed, 2005, p.37, p. 221). Moed recommends
using citation impact as a way to resolve the issue because it suggests the methodology used to
evaluate impact (Moed, 2005, p. 221). In this way it could be read as a general or technical
description. There seems to be a belief amongst bibliometricians that citation counts, if appropriately
normalised, can aid or replace peer judgements of quality for researchers (Bornmann & Daniel, 2006;
Gingras & Wallace, 2010; Meho & Sonnenwald, 2000), journals (Garfield, 2005), and departments
(Oppenheim, 1995, 1997; Smith & Eysenck, 2002; van Raan, 2000), but that they do not directly
measure quality because some high quality work attracts few citations and some poor work attracts
many. Instead citations are indicators, essentially meaning that they are sometimes wrong. Moreover,
in some contexts indicators may be of limited practical use. For example, journal citations may not
help much when evaluating individual arts and humanities researchers in book-based disciplines.
The idea behind citation analysis is that science is a cumulative enterprise and that in order to
contribute to the overall advancement of science, research has to be used by others to help with new
discoveries, and the prior work is acknowledged by citations (Merton, 1973). But research can also
have an impact on society, for example in the form of commercial exploitation, informing government
policy or informing/entertaining the public. Hence it could be argued that instead of counting just the
citations to published articles, it might also be helpful to count the number of times a researcher or
institution is mentioned – for example in the press or on the web (e.g., Cronin, 2001; Cronin & Shaw,
2002; Cronin et al., 1998). More generally, the “range of genres of invocation made possible by the
Web should help give substance to modes of influence which have historically been backgrounded”
(Cronin et al., 1998). As with motivations for citing (Bornmann & Daniel, 2008), there may be
2
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
negative and irrelevant mentions (for example, on the web) but this does not prevent counts of
mentions from being useful or correlating with other measures of value.
Link analysis for universities, although motivated by citation analysis, is not similar in the
sense that hyperlinks to universities rarely directly target their research documents (e.g. published
papers) but typically have more general targets, such as the university itself, via a link to a home page
(Thelwall, 2002b). Nevertheless links to universities correlate strongly with their research
productivity (e.g., Speaman’s rho 0.925 for UK universities: Thelwall, 2002a). Hence links to
universities can be used as indicators for research but what they measure is probably a range of things,
such as the extent of their web publishing, their size, their fame, the fame of their researchers,
professional activities and their contribution to education (see e.g., Bar-Ilan, 2005; Wilkinson,
Harries, Thelwall, & Price, 2003). Hence it seems reasonable to assess alternatives to inlinks: other
quantities that are measurable and may reflect a wide range of factors related to the wider impact or
fame of academic organisations. Two logical alternatives to hyperlinks are URL citations and title
mentions. A URL citation is similar to a link except that the URL is visible in a web page rather than
existing as a clickable link. It seems that moving from link counting to URL citation counting is a
small conceptual step. An organisation title mention (or invocation or web citation) is the inclusion of
the name of an organisation in a web page without necessarily linking to the organisation’s web site or
including its URL in the page. From a theoretical perspective, this is possibly a weaker indicator of
endorsement because links and URL citations contain navigation information, but titles do not.
Nevertheless, a title mention seems to be otherwise similar to URL citations and links in the sense that
all are invocations of the target organisation. As with mentions of individual researchers, organisation
title mentions could capture types of influence that would not be reflected by traditional citation
counts.
There is a practical problem with organisation title mentions: whilst URLs are unambiguous
and URL citations are almost unambiguous, titles are not (Vaughan & You, 2010). Therefore titlebased web searches may in some cases generate spurious matches. For example, the phrase “Oxford
University” also matches “Oxford University Press” but www.ox.ac.uk is unique to the university,
even if this domain may host some external web sites.
Literature review
This section discusses evidence for the strengths and weaknesses of the three methods discussed
above. Most of the studies reviewed analyse a particular type of web site or web page and so their
findings are typically limited in scope from the perspective of the current article.
Link counts
Counts of links to web sites formed the original type of online impact evidence (Ingwersen, 1998). A
variant of link counting, Google’s PageRank (Brin & Page, 1998) has been a high profile endorsement
that links indicate impact but a number of articles have also explicitly tested this. In particular, counts
of inlinks (i.e., hyperlinks originating in other web sites, sometimes called site inlinks (Björneborn &
Ingwersen, 2004)) to university web sites correlate with research productivity for the UK (Thelwall &
Harries, 2004), Australia (Thelwall, 2004) and New Zealand (Thelwall, 2004), and the same is true
for some disciplines in the UK (Li, Thelwall, Wilkinson, & Musgrove, 2005a). More directly, counts
of links to journal web sites correlate with journal Impact Factors (An & Qiu, 2004; Vaughan &
Hysen, 2002) for homogenous journal sets but numbers of links pointing to a research web site are not
a good indicator of researcher productivity (Barjak et al., 2007). In summary, there is good evidence
that in academic contexts inlink counts are reasonable indicators of academic impact at larger units of
aggregation than that of the individual researcher. However, in the case of university web sites the
correlation between inlinks and research productivity is because more productive researchers produce
more web content rather than because they produce better web content (Thelwall & Harries, 2004).
Outside of academic contexts, the concept of “impact” is much less clear but links to
commercial web sites have been shown to correlate with business performance measures (Vaughan,
2005; Vaughan & Wu, 2004) indicating that link counts are at least related to properties of the web
site owner.
3
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Web mention counts
Web mentions, i.e., the invocation of the name of a person or object, were introduced to identify web
pages invoking academics as a way of seeking wider evidence of the impact, value or fame of
individual academics (Cronin et al., 1998). Many of the web pages found in this study did indeed give
evidence of the wider academic activities of the scholars (e.g., conference attendance). A web
mention is a textual mention in a web page, typically of a document title or person’s name.
Nevertheless a web mention encompasses any non-URL textual description. Web mentions can be
found by normal search engine queries. Typically a phrase search may be used but additional terms
may be added to reduce spurious matches.
Web mentions were first extensively tested for journal articles (Vaughan & Shaw, 2003).
Using searches for article titles and (if necessary to avoid false matches) subtitles and author names,
web mentions (called web citations in the article) correlate with Social Sciences Citation Index
citations, partially validating web mentions as an academic impact indicator (see also: Vaughan &
Shaw, 2005).
Web mentions have also been applied to identify online citations of journal articles in more
specialised contexts: online presentations (Thelwall & Kousha, 2008), online course syllabuses
(Kousha & Thelwall, 2008) and (Google) books (Kousha & Thelwall, 2009). In each case a
significant correlation was found between Web of Science citation counts and web mentions for
individual articles. Hence there is good evidence that web mentions work well as impact indicators for
academic articles.
Web mentions have also been assessed as a similarity metric for organisations. Vaughan and
You (2010) compiled a list of 50 top WiMax or Long Term Evolution telecommunications companies
to identify patterns of similarity by finding how often pairs of companies were mentioned on the same
page. Company mentions were assessed through the company acronym, name or a short version of its
name. Five companies had to be excluded due to ambiguous names (e.g., names also used by other,
larger organisations). For each pair of companies, Google and Google Blogs searches were used to
count how many pages mentioned both and the results were used to produce a multi-dimensional
scaling diagram of the entire set. The results were compared with diagrams created by Yahoo coinlink searches for the same set of companies. The patterns produced by all methods were consistent
and matched the known industry sectors of the companies, giving evidence that counting co-mentions
of company names was a valid alternative to counting co-inlinks for similarity analyses. Moreover,
there was a suggestion that Google Blogs could be more useful than the main Google search, perhaps
due to blogs containing less shallow material (Vaughan & You 2010). This study used a similar
approach to the current paper, in the sense of comparing the results of title-based and link-based data,
except that it counted co-mentions instead of direct mentions, only used one query for each
organisation, did not directly compare the results for individual organisations (e.g., via a rank order
test), did not use URL Citations, and used Yahoo, Google and Google blogs instead of Bing and
Yahoo for the searches.
URL citation counts
URL citations are identified with search engine queries using a full or partial URL as a phrase search
(e.g., the query, “www.wlv.ac.uk”). These have only been evaluated for collections of journal articles,
also with positive results (Kousha & Thelwall, 2006, 2007a). Hence, in this limited context, URL
citations seem to be a reasonable indicator of academic impact.
URL citations have also previously been used in a business context to identify connections
between pairs of organisations. To achieve this, Google was used to identify, for each pair of
organisations (A, B), the number of web pages in A that contained a URL citation to B. The results
were used to construct network diagrams of the relationships between organisations (Stuart &
Thelwall, 2006).
Comparisons of citation count methods
Since there is evidence that all three measures may indicate online impact it would be reasonable to
compare them against each other. One study used Yahoo to compare URL citations with inlinks for 15
web site data sets from previous webometric projects. The results showed a significant positive
4
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
correlation between inlinks and URL citations in all cases, with the Spearman coefficient varying
from 0.436 (URLs of sites linking to an online magazine) to 0.927 (European universities). It was
found that URL citations were typically less numerous than inlinks outside of academic contexts and
were much less numerous when path information (i.e., sections of the URL after the domain name,
such as www.bt.com/contact/ rather than www.bt.com) was included in the URLs (Thelwall, 2011, in
press). Hence, URL citations seem to be particularly numerous for universities (which are academic
and have their own domain name) but may be problematic for sets of smaller, non-academic web sites
if some are likely not to have their own domain name. This is consistent with prior research indicating
that URL citations were relatively rare in commercial contexts (Stuart & Thelwall, 2006).
As the above review indicates, there is a lack of research evaluating web mentions and URL
citations for data other than journal article collections and there is a lack of comparative research on
other types of data, except for inlinks compared to URL citations.
Search engines in Webometric research
Commercial search engines are a common data source for webometric research, although some
studies use personal web crawlers instead. In consequence, the accuracy and consistency of search
engine results has also formed a part of webometrics. Two different types of search engine output can
be used: complete lists of results, and hit count estimates (HCEs). Complete lists of results can be
used if the total number of results is less than the maximum normally returned by search engines,
which is 1,000. The total number of matching pages obtained can be extended by using the query
splitting technique (Thelwall, 2008a), however, but this may not work well if there are more than
100,000 results, and it is slow. HCEs are the alternative choice if there are too many queries to get
exact page counts for or if there are too many results per query. HCEs are the estimates at the top of
search results pages about the total number of matching pages. It takes only one search per query to
obtain a HCE but a full list of 1,000 pages takes 20 searches because the matching pages are delivered
in sets of 50. Search engine application programming interfaces (APIs), as used by webometrics
software to automatically submit queries, are rate-limited to 5,000 (Yahoo before it closed in April
2011) or 10,000 (Bing) queries per day and for large webometrics exercises the additional up to 19
queries per search can result in impractically many queries to get full sets of results and so HCEs are
used instead. This is particularly relevant when queries are used to get data for network diagrams
because one search is needed per pair of web sites, so the number of queries needed scales with the
square of the number of sites involved.
In terms of the accuracy of search engine results: it has long been known that search engines
do not give comprehensive results since they do not index the whole web (Lawrence & Giles, 1999)
and results may fluctuate unpredictably over time (Bar-Ilan, 1999, 2001; Lewandowski, Wahlig, &
Meyer-Bautor, 2006; Rousseau, 1999). Moreover, they do not always return complete lists of matches
from their own index (Bar-Ilan & Peritz, 2008; Mettrop & Nieuwenhuysen, 2001) due, amongst other
factors, to spam removal and restrictions on the number of matches returned from the same web site
(Gomes & Smith, 2003). Moreover, the HCEs can vary between results pages for the same query,
which seems to be due to filtering unwanted pages from the results in stages rather than at the time of
submission of the original query (Thelwall, 2008b). Result stability may also vary in unpredictable
ways in different types of query (Uyar, 2009a, 2009b). In summary, search engine results should be
treated with caution, the HCEs and total number of results returned should normally be
underestimates of the total number of matching pages on the web, and there is no reason to believe
that different search engines would give similar results for the same queries.
Research questions
The objectives of this study are to compare all of the main web citation methods for assessing the
impact of academic organisations.
 Are URL citation counts, organisation title mention counts and inlink counts approximately
equivalent in Bing (excluding inlink counts) and in Yahoo?
 Are URL citation HCEs, organisation title mention HCEs and inlink HCEs approximately
equivalent in Bing (excluding inlink HCEs) and in Yahoo?
5
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)

Do Yahoo and Bing give approximately equivalent results for the same organisation title mention
queries and for the same URL citation queries?
In the above, the search engine Yahoo was chosen as the only major search engine still (in January
2011) allowing link searches. Bing was chosen because, like Yahoo, it is a major search engine and
allows automatic query submission using an API. Google was not used because its search API
(Google, 2011) only allows searching of a user-specified list of web sites, which is inadequate for the
current study and for most webometric studies. Although the Google web interface can be used with
manual human submission of queries, this greatly adds to the time needed to conduct a study and
would also require copying the Google results for automatic processing for query splitting (see below)
and duplicate elimination, a task that does not seem to have been attempted previously in any
webometric study. Note also that at the time of data collection (Janary-February, 2011) Bing and
Yahoo were both owned by Microsoft so the comparison between them is not ideal. Nevertheless, the
results below show that they perform far from identically in the tests.
Methods
This research uses two academic case studies: large sites for HCEs and small sites for counts of
URLs. The large web sites data set is a collection of 131 UK universities, with names and URLs taken
from http://cybermetrics.wlv.ac.uk/database/university_lists.htm (December 23, 2010) and
subsequently checked. This gives a data set of relatively large academic web sites. The small web
sites data set is a collection of US library and information science departments, with names and URLs
taken from the 2009 US News rankings at http://grad-schools.usnews.rankingsandreviews.com/bestgraduate-schools/top-library-information-science-programs/library-information-science-rankings
(February 10, 2011). Both of these choices are relatively arbitrary but should give insights into the
answers to the research questions.
For the URL citation searches, the domain name (universities) or full URL (departments) was
placed in quotes, after excluding the initial http://, and used as the main query. Note that the searches
always exclude all matches from the site of the owning organisation, using the exclusion operator, -,
and the web site match command, site:. Hence, for the University of Wolverhampton the query would
be "wlv.ac.uk" -site:wlv.ac.uk. The searches were submitted automatically via the APIs
of Bing and Yahoo, using Webometric Analyst. Search engine APIs do not return results that are
identical to the web interface (McCowan & Nelson, 2007), but they are broadly similar and it is the
APIs that are typically used for webometrics research and therefore are the most appropriate choice of
data source.
The organisation title searches were constructed as above, but with the organisation title in
quotes replacing the URL or domain name (e.g., "University of Wolverhampton" site:wlv.ac.uk). The LIS departments typically had names that included multiple parts that
might not be mentioned together so a full title search was not useful. For these, the constituent parts
were separated and each part was placed in quotes. For instance, in the following example the
university name and department name were separated: "Louisiana State University"
"School of Library and Information Science" -site:lsu.edu site:edu.
For the LIS departments, site:edu was added to the end of each query so that all results came (almost)
exclusively from US educational institutions to ensure that the total number of matches did not reach
much above 1,000. This was a practical consideration to test the method effectively rather than a
generic recommended procedure. Query splitting was also used for any query with over 850 results to
look for additional matches. The query splitting technique produces additional matches above the
normal maximum of 1,000 but causes many additional queries to be submitted and probably produces
less complete lists of results than the results for a query returning less than 1,000 matches. If the
purpose of the paper was to estimate the impact of the departments as accurately as possible, rather
than to test the methods in a reasonable setting, then the site:edu part would not have been added and
the limitation of less complete results and additional queries to be submitted would have been
accepted. For the UK universities, site:ac.uk was not added because the analysis for these is based
upon HCEs so additional restriction was not necessary.
The inlink searches were constructed as for the URL citation searches except that the
linkdomain: command was used instead of quotes around the domain name for the universities (e.g.,
6
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
linkdomain:wlv.ac.uk –site:wlv.ac.uk) to get links to anywhere in the university web
site. For the LIS departments, the link: command was used instead of the linkdomain command to get
all links to just the home page (e.g., link:http://www.slis.ua.edu -site:ua.edu
site:edu). This ensured that the query did not get too many matches but meant that the results of
the link searches for LIS departments were not fully compatible with the other two types of searches,
which matched any reference to the department. This inlink restriction was made because some of the
departments did not have their own domain name and so the linkdomain: command could not be used
with them. This kind of limitation seems to be common in small webometric studies and so the option
of removing departments without their own domain name was not considered.
A small methodological innovation has been introduced in this paper: the automatic inclusion
of multiple queries for the same organisation within the data collection software (Webometric
Analyst). Hence if an organisation has multiple names or web sites then these are automatically
queried separately and the results (HCEs or URL lists) are automatically combined (added or merged,
respectively). For instance the Graduate School of Library and Information Science at Queens
College, City University of New York has two alternative home page URLs:
http://qcpages.qc.edu/GSLIS/ and http://qcpages.qc.cuny.edu/GSLIS/. In order to get comprehensive
link counts, searches to both of them were calculated and the results combined. For URL lists,
Webometric Analyst combines the searches and eliminates duplicate URLs. For searches with over
1,000 matches, the query splitting used (see above) gets additional matches beyond the normal search
engine limit of 1,000, allowing the combining operation to proceed. Note that for HCEs the addition
of the results of different searches is problematic because duplicate elimination is not possible; this is
a limitation of the method. In Webometric Analyst, a pipe character | separates searches that are to be
submitted separately but have results that will be combined. Hence the input to Webometric Analyst
was the following for this query, all on a single line.
link:http://qcpages.qc.edu/GSLIS/
-site:qcpages.qc.edu/GSLIS/
-site:qcpages.qc.cuny.edu/GSLIS/ site:edu|
link:http://qcpages.qc.cuny.edu/GSLIS/
-site:qcpages.qc.edu/GSLIS/
-site:qcpages.qc.cuny.edu/GSLIS/ site:edu
The same queries were used for Yahoo and Bing, via Webometric Analyst, except that the
link searches were not submitted to Bing because it does not support link searching. For the US LIS
searches, the exact number of matching URLs and domains were calculated with Webometric
Analyst's Reports function. The number of domains is calculated by extracting the domain name of
each matching URL and then counting the number of unique matching domain names. Domains were
counted because domain or site counts are frequently used in webometrics research instead of URL
counts.
Spearman correlations were used to assess the degree of agreement between results sets.
There is no gold standard to compare the results against but the correlations can be used to assess the
extent to which the different methods produce consistent results. Spearman correlations do not assess
the volume of matches from each method but this is ultimately less important than the correlation
because the latter reflects the rank order of the organisations. For each data set an external judgement
of one aspect of their performance was chosen in lieu of a gold standard, for a check of external
validity. For the US LIS departments, the 2009 US News rankings were chosen. These are based upon
peer evaluations but are somewhat teaching-focussed since the publication is aimed at new students.
For the UK, the Research Assessment Exercise (RAE) 2008 statistics were used to calculate a simple
research productivity index. The RAE effectively gives each submitted academic from each university
a score between 0 and 4 for their research quality. These scores were combined for each university
with the formula n1+2n2+3n3+4n4, where ni is the number of academics scoring rating i. This is an
oversimplification because a score of 4 could be worth many more times as much as four scores of 1
(e.g., in terms of research funding) but it would be difficult to justify any other specific weights and so
this simple approach has been adopted. The result for each university may be reasonably described as
an indicator of research productivity.
7
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Results
The two case studies are reported separately, starting with the US LIS departments.
URL counts and domain counts: The case of US LIS Departments
Tables 1 and 2 show the Spearman correlations for the web measures and the US News rankings of
the departments. The five unranked departments (from Southern Connecticut State University,
University of Denver, University of Puerto Rico, University of Southern Mississippi, Valdosta State
University) were given the joint lowest rank as these were assumed to be not highly ranked
departments. There is little difference overall between the URL counting measures and the domain
counting measures, perhaps because both search engines tend to filter out multiple results from the
same domain after the first two. The highest correlations are between the same measures from
different search engines: URL cites and titles. The lower but positive correlations between different
measures suggest that the three different measures are not interchangeable but have a large degree of
commonality.
Table 1. Spearman correlations for the five URL counting web measures and US News rankings for
the 49 US LIS departments*.
URL counting
US News Rank
Yahoo inlinks
Bing titles
Yahoo titles
Bing URL cites
Yahoo URL cites
Rank
1.00
Yahoo
inlinks
-0.62
1.00
Bing
titles
-0.72
0.65
1.00
Yahoo
titles
-0.74
0.68
0.96
1.00
Bing
URL
cites
-0.64
0.85
0.67
0.70
1.00
Yahoo
URL
cites
-0.65
0.87
0.77
0.78
0.95
1.00
* All correlations are significant with p = 0.000.
Table 2. Spearman correlations for the five domain counting web measures and US News rankings for
the 49 US LIS departments*.
Domain counting
US News Rank
Yahoo inlinks
Bing titles
Yahoo titles
Bing URL cites
Yahoo URL cites
Rank
1.00
Yahoo
inlinks
-0.64
1.00
Bing
titles
-0.70
0.67
1.00
Yahoo
titles
-0.73
0.69
0.97
1.00
Bing
URL
cites
-0.64
0.85
0.72
0.73
1.00
Yahoo
URL
cites
-0.67
0.89
0.77
0.78
0.98
1.00
* All correlations are significant with p = 0.000.
HCEs: The case of UK universities
Table 3 reports the correlations between the different HCEs for UK universities. These correlations
are sometimes much lower than the comparable correlations for US LIS departments. The Bing title
correlations are all low, suggesting that Bing should not be used for title search HCEs for large
organisations or for large web sites. The remaining four measures all correlate highly with each other
and with the RAE 2008 rankings, giving some confidence that they are reasonably interchangeable.
There are some significant outliers, however. For example, there is a particularly high value of
66,633,700 for the total of the searches "ex.ac.uk"
-site:ex.ac.uk
site:exeter.ac.uk site:ac.uk and "exeter.ac.uk" -site:ex.ac.uk site:exeter.ac.uk site:ac.uk. The search was run repeatedly, giving varied but similarly
high results and so it is not a one-off estimation anomaly but is related to the search itself. Some of the
results for the first searches did not match ex.ac.uk but matched sussex.ac.uk instead, with the
8
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
document parser splitting the word sussex before ex. This may account for the remaining anomalously
high values, especially because the domain sussex.ac.uk is not excluded from the searches. Adding –
site:sussex.ac.uk to the searches gives 28,300,000 + 850,000 = 29,150,000 matches, which is still
anomalously high, however.
Table 3. Spearman correlations for the five HCE web measures and RAE rankings for the 131 UK
universities.
HCE
RAE2008
RAE 2008
Yahoo inlinks
Bing titles
Yahoo titles
Bing URL cites
Yahoo URL cites
1.00
Yahoo
inlinks
0.86*
1.00
Bing
titles
Yahoo
titles
0.13
0.06
1.00
0.88*
0.91*
0.16
1.00
Bing
URL
cites
0.78*
0.86*
0.08
0.84*
1.00
Yahoo
URL
cites
0.88*
0.94*
0.03
0.94*
0.90*
1.00
* significant at p = 0.001 (other correlations not significant at p = 0.05)
Discussion and limitations
An important limitation of this research is that only one data set was used for each case and it is
possible that different conclusions would be drawn from different data sets, particularly if they were
associated with unusual use of links, URL citations or organisation titles. Hence it is recommended
that future webometrics impact research should routinely compare the results of different methods and
reject any method that gives inconsistent results for any reason.
A second limitation is that the results were only analysed quantitatively and therefore no
evidence was provided that the three methods (and the organisation title searches in particular) give
results that are meaningful in the sense that the source pages correctly mention the target organisation
in a positive context that makes the mentions worth counting. The significant correlations found with
offline measures (US News and RAE 2008) provide some evidence to support the value of the results,
however.
In terms of the accuracy of the results of the searches, there seem to be four different types of
problems, which are named and described in Table 4, together with advice for dealing with them.
Note that the names differ from some used previously, such as technical relevance (Bar-Ilan & Peritz,
2009), because of a focus on a different task. For instance a URL is technically relevant to a query if it
matches the query, whether or not the search engine returns it. This relates to the search engine
coverage issue in Table 4 because not all technically relevant URLs are returned for the typical query.
The issues in Table 4 all represent unavoidable limitations in webometrics research but in most cases
steps can be taken to minimise their impact.
Conclusions
The discussion above confirms that webometric studies have a number of unavoidable limitations in
the accuracy of the results. If highly accurate results are needed then these are unlikely to be obtained
from webometric methods. This limitation does not prevent webometric techniques from delivering
results that are useful in some contexts, however. If indicators are acceptable, such as metrics that are
known normally to correlate highly with relevant research measures, then the results suggest that all
three metrics discussed here (inlinks, URL citations and organisation title mentions) can be suitable
for research organisations. This was previously known for inlinks but not for the other two metrics.
For exact counting, the correlations between the three metrics for URL and domain counts
(tables 1 and 2) are all high enough to support a case that they could be used interchangeably for
impact measurements, although there will be differences in their results. Moreover, the results of Bing
and Yahoo have high correlations, suggesting that these are almost equivalent in practice. Although
inlinks have been used primarily in webometric impact research, this suggests that URL citations and
organisation title mentions (e.g., title searches) are appropriate replacements if link searches are
withdrawn by Yahoo.
9
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
For HCEs, Bing is not recommended for title searches for large organisations or web sites
because it gave inconsistent results that disagree with all Yahoo results and with URL citations in
Bing. Bing also gave search engine retrieval anomalies for URL citation searches, although the results
overall were consistent in the sense of correlating highly with results from Yahoo.
One drawback with organisation title mentions is that the exact search to be used is
ambiguous because organisations can be referred to with different equivalent names, and suborganisations may be referred to in even more different ways in conjunction with the parent
organisation (the query polysemy issue discussed above). Hence manual labour is needed to identify
not only common name variations but also effective combinations of searches to capture the most
common ways in which an organisation is described online. At the level of analysing results extra
work is not needed, however, since Webometric Analyst has a facility to automatically combine the
results of multiple searches and also to eliminate duplicates if URL or domain counting is used. This
makes organisation title mentions a practical impact metric for webometrics.
Table 4. Sources of inaccuracy in search engine results
Name and issue
Examples
Practical implications
Search engine coverage The absolute numbers of The results should never be
Search engines cover results varies substantially claimed to represent the “whole
only a fraction of the between the search engines web”. The goal of webometric
web and this fraction used. All results reported in studies should be to get the
varies between search this paper are probably relative numbers of results
engines.
underestimates of the number between queries consistent,
of matching web pages.
rather than to get all results.
Search engine retrieval The variation from a straight Use triangulation with other
variation Due to spam line in the data may be partly methods or search engines to
filtering (and estimation due to retrieval variation.
decide if the variations are too
algorithms for HCEs)
large to use the data (e.g., Bing
there will be variations
Title HCEs). If exact ranking is
in the extent to which
important, consider combining
the results retrieved
multiple webometric sources in
from a search engine
an
appropriate
linear
match the results known
combination to smooth out
to it.
variations.
Search engine retrieval Sussex.ac.uk being mistaken When possible, multiple methods
anomalies Due to text for suss ex.ac.uk gave an or search engines should be
processing
algorithm anomaly in the ex.ac.uk triangulated together to help
configurations,
some results.
identify anomalies and select an
queries may return
anomaly free search engine and
many incorrect matches.
method.
Query polysemy A The
query
“Oxford The researcher should try to
query
may
match University” matches Oxford construct one or more queries
irrelevant documents.
University
Press.
UCL that match most relevant
matches University College documents and match few
London
and
Université irrelevant documents. If this is
Catholique de Louvain and is a impossible, manual filtering of
commonly-used abbreviation results may be needed or queries
for both.
with low coverage used, in
conjunction with a disclaimer.
References
Aguillo, I. F. (2009). Measuring the institution's footprint in the web. Library Hi Tech, 27(4), 540556.
10
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Aguillo, I. F., Granadino, B., Ortega, J. L., & Prieto, J. A. (2006). Scientific research activity and
communication measured with cybermetrics indicators. Journal of the American Society for
Information Science and Technology, 57(10), 1296-1302.
An, L., & Qiu, J. P. (2004). Research on the relationships between Chinese journal impact factors and
external web link counts and web impact factors. Journal of Academic Librarianship, 30(3),
199-204.
Bar-Ilan, J. (1999). Search engine results over time - a case study on search engine stability.
Cybermetrics Retrieved January 26, 2006, from
http://www.cindoc.csic.es/cybermetrics/articles/v2i1p1.html
Bar-Ilan, J. (2001). Data collection methods on the Web for informetric purposes - A review and
analysis. Scientometrics, 50(1), 7-32.
Bar-Ilan, J. (2005). What do we know about links and linking? A framework for studying links in
academic environments. Information Processing & Management, 41(4), 973-986.
Bar-Ilan, J., & Peritz, B. C. (2008). The lifespan of 'informetrics' on the Web: An eight year study
(1998-2006). Scientometrics, 79(1), 7-25.
Bar-Ilan, J., & Peritz, B. C. (2009). A method for measuring the evolution of a topic on the Web: The
case of “Informetrics”. Journal of the American Society for Information Science and
Technology, 60(9), 1730-1740.
Barjak, F., Li, X., & Thelwall, M. (2007). Which factors explain the web impact of scientists' personal
home pages? Journal of the American Society for Information Science and Technology, 58(2),
200-211.
Barjak, F., & Thelwall, M. (2008). A statistical analysis of the web presences of European life
sciences research teams. Journal of the American Society for Information Science and
Technology, 59(4), 628-643.
BBC. (2009). Profiles: Yahoo, Google and Microsoft. BBC News, Retrieved February 28, 2011 from:
http://news.bbc.co.uk/2011/hi/business/7222431.stm.
Björneborn, L., & Ingwersen, P. (2004). Toward a basic framework for webometrics. Journal of the
American Society for Information Science and Technology, 55(14), 1216-1227.
Bornmann, L., & Daniel, H.-D. (2006). Selecting scientific excellence through committee peer review
– A citation analysis of publications previously published to approval or rejection of postdoctoral research fellowship applicants. Scientometrics, 68(3), 427-440.
Bornmann, L., & Daniel, H.-D. (2008). What do citation counts measure? A review of studies on
citing behavior. Journal of Documentation, 64(1), 45-80.
Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual Web search engine. Computer
Networks and ISDN Systems, 30(1-7), 107-117.
Chen, C., Newman, J., Newman, R., & Rada, R. (1998). How did university departments interweave
the Web: A study of connectivity and underlying factors. Interacting With Computers, 10(4),
353-373.
Cothey, V. (2004). Web-crawling reliability. Journal of the American Society for Information Science
and Technology, 55(14), 1228-1238.
Cronin, B. (2001). Bibliometrics and beyond: some thoughts on web-based citation analysis. Journal
of Information Science, 27(1), 1-7.
Cronin, B., & Shaw, D. (2002). Banking (on) different forms of symbolic capital. Journal of the
American Society for the Information Science, 53(13), 1267-1270.
Cronin, B., Snyder, H. W., Rosenbaum, H., Martinson, A., & Callahan, E. (1998). Invoked on the
web. Journal of the American Society for Information Science, 49(14), 1319-1328.
Garfield, E. (2005). The agony and the ecstasy: The history and the meaning of the journal impact
factor. Fifth International Congress on Peer Review in Biomedical Publication, in Chicago,
USA, Retrieved September 27, 2007:
http://garfield.library.upenn.edu/papers/jifchicago2005.pdf.
Gingras, Y., & Wallace, M. L. (2010). Why it has become more difficult to predict Nobel Prize
winners: A bibliometric analysis of nominees and winners of the chemistry and physics prizes
(1901-2007) Scientometrics, 82(2), 401-412.
11
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Gomes, B., & Smith, B. T. (2003). Detecting query-specific duplicate documents. United States
Patent 6,615,209, Available: http://www.patents.com/Detecting-query-specific-duplicatedocuments/US6615209/en-US/.
Google (2011). JSON/Atom Custom Search API, Retrieved April 8, 2011 from:
http://code.google.com/apis/customsearch/v1/overview.html
Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation, 54(2), 236243.
Kousha, K., & Thelwall, M. (2006). Motivations for URL citations to open access library and
information science articles Scientometrics, 68(3), 501-517.
Kousha, K., & Thelwall, M. (2007a). Google Scholar citations and Google Web/URL citations: A
multi-discipline exploratory analysis. Journal of the American Society for Information
Science and Technology, 58(7), 1055 -1065.
Kousha, K., & Thelwall, M. (2007b). How is science cited on the web? A classification of Google
unique web citations. Journal of the American Society for Information Science and
Technology, 58(11), 1631-1644.
Kousha, K., & Thelwall, M. (2008). Assessing the impact of disciplinary research on teaching: An
automatic analysis of online syllabuses. Journal of the American Society for Information
Science and Technology, 59(13), 2060-2069.
Kousha, K., & Thelwall, M. (2009). Google Book Search: Citation analysis for social science and the
humanities. Journal of the American Society for Information Science and Technology, 60(8),
1537-1549.
Kousha, K., Thelwall, M., & Rezaie, S. (2010). Using the Web for research evaluation: The Integrated
Online Impact indicator. Journal of Informetrics, 4(1), 124-135.
Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740), 107109.
Lewandowski, D., Wahlig, H., & Meyer-Bautor, G. (2006). The freshness of web search engine
databases. Journal of Information Science, 32(2), 131-148.
Li, X., Thelwall, M., Wilkinson, D., & Musgrove, P. B. (2005a). National and international university
departmental web site interlinking, part 1: Validation of departmental link analysis.
Scientometrics, 64(2), 151-185.
Li, X., Thelwall, M., Wilkinson, D., & Musgrove, P. B. (2005b). National and international university
departmental web site interlinking, part 2: Link patterns. Scientometrics, 64(2), 187-208.
MacRoberts, M. H., & MacRoberts, B. R. (1996). Problems of citation analysis. Scientometrics,
36(3), 435-444.
McCowan, F., & Nelson, M. L. (2007). Agreeing to disagree: Search engines and their public
interfaces. Joint Conference on Digital Libraries, Retrieved June 18, 2007 from:
http://www.cs.odu.edu/~fmccown/pubs/se-apis-jcdl2007.pdf.
Meho, L. I., & Sonnenwald, D. H. (2000). Citation ranking versus peer evaluation of senior faculty
research performance: A case study of Kurdish scholarship. Journal of American Society for
Information Science, 51(2), 123-138.
Merton, R. K. (1973). The sociology of science. Theoretical and empirical investigations. Chicago:
University of Chicago Press.
Mettrop, W., & Nieuwenhuysen, P. (2001). Internet search engines - fluctuations in document
accessibility. Journal of Documentation, 57(5), 623-651.
Moed, H., F. (2005). Citation analysis in research evaluation. New York: Springer.
Oppenheim, C. (1995). The correlation between citation counts and the 1992 Research Assessment
Exercise ratings for British library and information science universtiy departments. Journal of
Documentation, 51, 18-27.
Oppenheim, C. (1997). The correlation between citation counts and the 1992 Research Assessment
Exercise ratings for British research in Genetics, Antomy and Archaeology. Journal of
Documentation, 53, 477-487.
Qiu, J. P., Chen, J. Q., & Wang, Z. (2004). An analysis of backlink counts and Web Impact Factors
for Chinese university websites. Scientometrics, 60(3), 463-473.
12
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Rousseau, R. (1999). Daily time series of common single word searches in AltaVista and
NorthernLight.
Cybermetrics,
2/3,
Retrieved
July
25,
2006
from:
http://www.cindoc.csic.es/cybermetrics/articles/v2002i2001p2002.html.
Seglen, P. O. (1998). Citation rates and journal impact factors are not suitable for evaluation of
research. ACTA Orthopaedica Scandinavica, 69(3), 224-229.
Seidman, E. (2007). We are flattered, but... Bing Community, Retrieved November 11, 2009 from:
http://www.bing.com/community/blogs/search/archive/2007/2003/2028/we-are-flatteredbut.aspx.
Smith, A., & Eysenck, M. (2002). The correlation between RAE ratings and citation counts in
psychology. Retrieved April 14, 2005 from: http://psyserver.pc.rhbnc.ac.uk/citations.pdf.
Smith, A. G. (1999). A tale of two web spaces; comparing sites using Web Impact Factors. Journal of
Documentation, 55(5), 577-592.
Stuart, D., & Thelwall, M. (2006). Investigating triple helix relationships using URL citations: A case
study of the UK West Midlands automobile industry. Research Evaluation, 15(2), 97-106.
Thelwall, M. (2002a). Conceptualizing documentation on the Web: An evaluation of different
heuristic-based models for counting links between university web sites. Journal of American
Society for Information Science and Technology, 53(12), 995-1005.
Thelwall, M. (2002b). The top 100 linked pages on UK university Web sites: high inlink counts are
not usually directly associated with quality scholarly content. Journal of Information Science,
28(6), 485-493.
Thelwall, M. (2004). Methods for reporting on the targets of links from national systems of university
Web sites. Information Processing and Management, 40(1), 125-144.
Thelwall, M. (2008a). Extracting accurate and complete results from search engines: Case study
Windows Live. Journal of the American Society for Information Science and Technology,
59(1), 38-50.
Thelwall, M. (2008b). Quantitative comparisons of search engine results. Journal of the American
Society for Information Science and Technology, 59(11), 1702-1710.
Thelwall, M. (2011, in press). A comparison of link and URL citation counting. ASLIB Proceedings,
Retrieved
February
28
from:
http://cybermetrics.wlv.ac.uk/paperdata/LinkURLcitationPreprint.doc.
Thelwall, M., & Harries, G. (2004). Do the Web sites of higher rated scholars have significantly more
online impact? Journal of the American Society for Information Science and Technology,
55(2), 149-159.
Thelwall, M., & Kousha, K. (2008). Online presentations as a source of scientific impact?: An
analysis of PowerPoint files citing academic journals. Journal of the American Society for
Information Science and Technology, 59(5), 805-815.
Thelwall, M., & Zuccala, A. (2008). A university-centred European Union link analysis.
Scientometrics, 75(3), 407-420.
Uyar, A. (2009a). Google stemming mechanisms. Journal of Information Science, 35(5), 499-514.
Uyar, A. (2009b). Investigation of the accuracy of search engine hit counts. Journal of Information
Science, 35(4), 469-480.
van Raan, A. F. J. (2000). The Pandora's box of citation analysis: Measuring scientific excellence-The
last evil? In B. Cronin & H. B. Atkins (Eds.), The Web of Knowledge: A Festschrift in Honor
of Eugene Garfield (pp. 301-319). Medford, NJ: Information Today, Inc. ASIS Monograph
Series.
Vaughan, L. (2005). Exploring website features for business information. Scientometrics, 61(3), 467477.
Vaughan, L., & Hysen, K. (2002). Relationship between links to journal Web sites and impact factors.
ASLIB Proceedings, 54(6), 356-361.
Vaughan, L., & Shaw, D. (2003). Bibliographic and Web citations: What is the difference? Journal of
the American Society for Information Science and Technology, 54(14), 1313-1322.
Vaughan, L., & Shaw, D. (2005). Web citation data for impact assessment: A comparison of four
science disciplines. Journal of the American Society for Information Science & Technology,
56(10), 1075-1087.
13
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Vaughan, L., & Wu, G. (2004). Links to commercial websites as a source of business information.
Scientometrics, 60(3), 487-496.
Vaughan, L. & You, J. (2010). Word co-occurrences on Webpages as a measure of the relatedness of
organizations: A new Webometrics concept. Journal of Informetrics, 4(4), 483–491.
Wilkinson, D., Harries, G., Thelwall, M., & Price, E. (2003). Motivations for academic Web site
interlinking: Evidence for the Web as a novel source of information on informal scholarly
communication. Journal of Information Science, 29(1), 49-56.
Yahoo (2011a). When will the change happen? How long will the transition take? Retrieved April 9,
2011 from: http://help.yahoo.com/l/us/yahoo/search/alliance/alliance-2.html
Yahoo (2011b). Web Search APIs from Yahoo! Search. Retrieved April 13, 2011 from:
http://developer.yahoo.com/search/web/webSearch.html
Zuccala, A., Thelwall, M., Oppenheim, C., & Dhiensa, R. (2007). Web intelligence analyses of digital
libraries: A case study of the National electronic Library for Health (NeLH). Journal of
Documentation, 64(3), 558-589.
14
This is a preprint of an article accepted for publication in Journal of the American Society for Information
Science and Technology copyright © 2011 (American Society for Information Science and Technology)
Download