Digging Deep for Hidden Information in the Web Part 2: Automated hyperlink

advertisement
Digging Deep for Hidden
Information in the Web
Part 1: Automated blog analysis
Part 2: Automated hyperlink
analysis
Part 1 Automated Blog Analysis
Analysing Public Science Debates
through Blogs and Online News
Sources
Part 1 Contents
Background



Blogs
Online news sources
RSS
Tracking public science debates
Detecting public science debates
Background
Blogs, public opinion, online news,
RSS
Background
There are millions of bloggers
Bloggers are almost normal human
beings
Automatically tracking bloggers’
postings may give insights into public
opinion
Blog tracking companies
IBM

WebFountain
Intelliseek


BlogPulse
“Monitor, measure and leverage consumergenerated media”
Others growing…
RSS Format
Rich Site Syndication/Really Simple
Syndication


XML technology
Used for frequently updated information sources
(blogs, news, academic journals)
RSS Readers



Users subscribe to the RSS feeds of favourite
blogs/sites/journals/searches
Notified when updates available
User-controlled ‘push’ technology
Tracking Public Science Debates
Blog keyword searches
Technorati “Searches weblogs by keyword
and for links”

Stem cell research
Blogdigger

stem cell research
IceRocket


Allows Advanced searches
Allows genuine date range search (Google only
allows “last updated” date range searches)
Track evolution over time
What is changing about interest in Stem cell
research/GM food?
Are experts good at identifying changes in
public interest?
How can experts be sure/can they be
supported with quantitative information?
Can blogs be used to generate time series
reflecting changes in “public interest”?
Free science debate graphs
Solves the trend identification problem?
Blogpulse Offers free automatic blog
searches and keyword-generated clicksearch graphs



Stem cell research
GM food
Mobile phone radiation
Research graphs
Time-consuming to collect data
Give control over the data source
Detecting Public Science Debates
How to detect a new debate?
Heuristic methods

E.g. Read papers, scan relevant blogs
Automatic methods

E.g. look for sudden increase in usage of
science-related words in blogs?
Free hot topic searches
Blog keyword search (sort by date)

Technorati “Searches weblogs by keyword and for
links”
 Stem cell research

Blogdigger blog search
Hot topic searches


Blogdex – top contagious information
Bloglines – today’s hot topics (most popular links)
Searches find the really big science debates?
Specialist research tools
Commercial software

Intelliseek/IBM
Mozdeh RSS monitor




Generates sub-collections
Generates word time series
Allows keyword searches
Identifies hot topics
Mozdeh Science Concern Corpus
A collection of blog postings containing
a fear word AND a science word
Trend detection used to identify hot
“science fear” topics
Data cleaning to remove spam
Need manual scanning of list of words
experiencing biggest usage increase
Classification of top 5 words
Word
Max. daily
Classification
increase (feeds)
stem
19%
Science fear (stem
cell research)
orlean
16%
Information (about
hurricane)
hurricane 16%
Duplicate of ‘orlean’
katrina
15%
Duplicate of ‘orlean’
june
14%
Temporal descriptor
Classification of top 200 words
The words
come from
multiple
stories
Random
Temporal Descriptor
Duplicate
Other
Threat Prediction
Progress
Information
Fear of Science
0
20
40
60
Hot science fear words
E.g. new medical cure
80
7.5% of
top 200
Words
Represent
new public
fears of
Science
stories
Unexpected results?
Social science research

Sudden burst of discussion over fears of the
economic theories of Karl Rove, an influential
advisor to George Bush
Computer security


Concern over spyware features in a software
vendor’s products
Research showing that consumers’ pin numbers
could be revealed by poor printing
Conclusions
Many free tools support exploration of
Consumer Generated Media
Also room for specialist research tools
References
http://www.blogpulse.com/
http://www.blogpulse.com/www2006workshop/
http://www.creen.org/

Thelwall, M., Prabowo, R. & Fairclough, R.
(2006, to appear). Are raw RSS feeds suitable
for broad issue scanning? A science concern
case study. Journal of the American Society for
Information Science and Technology.
Acknowledgement
The work was supported by a European
Union grant for activity code NEST2003-Path-1. It is part of the CREEN
project (Critical Events in Evolving
Networks, contract 012684,
http://www.creen.org/)
Part 2: Automated hyperlink
analysis
Link analysis as a social science
technique
Link Analysis Manifesto
Links are:


A wonderful new source of information
about relationships between people,
organisations and information
An easy to collect data source
But:

Results should be interpreted with care
Part 2 Contents
Academic link analysis –mainly from an
information science perspective
A general social science link analysis
methodology
Commercial applications
Why Count Links?
Individual hyperlinks may reflect connections
between web page contents or creators
Counts of large numbers of hyperlinks may
reflect wider underlying social processes
Links may reflect phenomena that have
previously been difficult to study

E.g. informal scholarly communication
Why Count University Links?
To map patterns of communication between
researchers in a country
Which universities collaborate a lot?
Which universities collaborate with
government or industry?
Which universities are using the web
effectively?
Counting links
Search engines will count them for you!
Yahoo! advanced queries, e.g. Links from
Wolves Uni. to Oxford Uni. Or back


domain:ox.ac.uk AND linkdomain:wlv.ac.uk
domain:wlv.ac.uk AND linkdomain:ox.ac.uk
Google link queries
Find links to specific URLs, e.g. links to the
University home page
link:www.wlv.ac.uk

Counting links
Can use a special purpose web crawler
or robot



Visits all the pages in a web site
Counts the links in the site
Can use “advanced” counting methods
Some Inter-University
Hyperlink Patterns
Mainly for the UK and Europe
Links to UK universities against
their research productivity
The reason for the
strong correlation is
the quantity of Web
publication, not its
quality
This is different to
citation analysis
Most links are only loosely
related to research
90% of links between UK university sites
have some connection with scholarly activity,
including teaching and research

But less than 1% are equivalent to citations
So link counts do not measure research
dissemination but are more a natural byproduct of scholarly activity


Cannot use link counts to assess research
Can use link counts to track an aspect of
communication
UK universities tend to link to
their neighbours
Universities
cluster
geographically
Language is a factor in
international interlinking
English the dominant language for Web sites
in the Western EU
In a typical country, 50% of pages are in the
national language(s) and 50% in English
Non-English speaking extensively interlink in
English
Others
328,644
Danish
86,107
Language
Portugese
172,804
Finnish
444,974
Norwegian
458,961
Italian
488,172
French
885,432
Greek
941,420
Dutch
962,092
Swedish
1,008,353
Spanish
1,094,442
German
2,888,072
English
12,379,256
-
2,000,000
4,000,000
6,000,000
8,000,000
Total university Web pages
10,000,000
12,000,000
14,000,000
University Web page languages
100%
80%
Others
French
Dutch
Swedish
German
English
60%
40%
20%
0%
fr
it
de
es
gr
no
nl
pt
ch
Country
be
dk
at
se
uk
ie
fi
Patterns of international
communication
Counts of links
between EU
universities in
Swedish are
represented by
arrow thickness.
Counts of
links between
EU
universities in
French are
represented
by arrow
thickness.
Which
language???
Which
language???
Which
language?
Who is
isolated?
International link patterns
The next slide is a (Kamada-Kawai)
network of the interlinking of the “top”
5 universities in AEAN countries (Asia
and Europe) with arrows representing
at least 100 links and universities not
connected removed.
The rich get richer on the web

Link creation obeys the ‘rich get richer’ law
 Sites which already have a lot of links attract
the most new links
 Some sites have a huge number of links: most
have one or none
Rich get richer example: Links from
Australian university pages
The anomalies
are also
interesting
Part 3: A General Social
Science Link Analysis
Methodology
A general framework for using link counts in
social sciences research


For research into link creation or
Together with other sources, for research into
other online or offline phenomena
Applicable when there are enough links
relevant to the research question to count


For collections of large web sites or
For large collections of small web sites
Nine stages for a research
project
1. Formulate an appropriate research
question, taking into account existing
knowledge of web structure
2. Conduct a pilot study
3. Identify web pages or sites that are
appropriate to address the research
question
Nine stages for a research
project
4. Collect link data from a commercial
search engine or a personal crawler, taking
appropriate accuracy safeguards
5. Apply data cleansing techniques to the
links, if possible, and select an appropriate
counting method
6. Partially validate the link count results
through correlation tests, if possible
Nine stages for a research
project
7. Partially validate the interpretation of the results
through a link classification exercise
8. Report results with an interpretation consistent
with link classification exercise, including either
a detailed description of the classification or
exemplars to illustrate the categories
9. Report the limitations of the study and
parameters used in data collection and
processing
The theoretical perspective for
link counting
In order to be able to reliably interpret link
counts, all links should be created



individually and independently,
by humans,
through equivalent gravity judgments (e.g., about
the quality of the information in the target page).
Additionally, links to a site should target pages
created by the site owner or somebody else closely
associated with the site.
Commercial applications
Of link analysis
Commercial applications
Find out who links to your web site


More links mean more visitors
Check if your web site is being recognised
Find out who isn’t linking to your site


But is linking to a competitor’s web site!
Gives ideas about where to get new
customers or links from
Takes an hour of advanced searches

Simple but very valuable!
Conclusion
There is a lot of hidden
information in the web: in blogs
and hyperlinks
Co-authors
Ray Binns, Viv Cothey, Ruth Fairclough, Gareth
Harries , Xuemei Li, Peter Musgrove, Teresa PageKennedy, Nigel Payne, Rudy Prabowo, Liz Price,
David Stuart, David Wilkinson, Alesia Zuccala
University of Wolverhampton.
Rong Tang, Catholic University of America.
Han-Woo Park, YeungNam University, South Korea.
Paul Wouters, Andrea Scharnhorst. The Virtual
Knowledge Studio for the Humanities and Social
Sciences, Amsterdam, The Netherlands.
Download