Analytics and Access to the UK Web Archive

advertisement
Access and Analytics to the UK Web Archive
Lewis Crawford,
Web Archive Technical Lead
The British Library
Introduction
This talk will cover:

Background of the UK Web Archive

Traditional access methods to Web Archives

Full text search for resource discovery

Problems of scale – needles and haystacks
Web Archiving: the basics

What
Selecting, capturing, storing, preserving and managing access to
snapshots of websites over time

How
Use crawler software to download websites automatically
Selective or domain archiving
Provide access in a Web Archive

When
Since mid 1990s

Who
Heritage and memory organisations, eg (IIPC)
University libraries
Not-for-profit and commercial organisations, eg Internet Archive
Individual researchers

Why
Global information resource
Artefact of cultural and technology change
Representative sample of the web: historical and sociological data that
may not be found elsewhere
Part of national digital heritage - legal requirements
UK Web Archive:
Web archive as historical documents
5
Multimedia based content
3D visualisation wall
Full text search
N-gram visualisation
N-gram visualisation
Media based results
Semantic analysis
Scale: needle and haystack
Subject hierarchy visualisation UK Web Archive
 ~ 10,000 websites collected since 2004
 ~ 40,000 instances

Google: “seen 1 trillion
unique URLs”

more than a billion new
pages are added to the web
every day

The UK web domain
9 million .uk domain
names registered in
December 2010
~ 1 million using other
domain names
Growing at 11% - 14% per
year
40% estimated to be in
scope for Legal
Deposit
Estimated ~110TB each
UK domain crawl
The value of the haystacks – content visualisation
Big Data analytics

Java Map/Reduce to use
Tika to extract text and
generate XML files for Solr
ingest

Hive & Pig for ad hoc query
analysis
Search indexing process
Node 1
XML Media store
SOLR
DIH Indexes new xml
Generate xml files
XML Image store
SOLR
DIH Indexes new xml
Hadoop
XML Document store
SOLR
DIH Indexes new xml
Node 50
Dedicated
Indexer
Indexer
Replication
Replication
SOLR
Dedicated
Search
Replication
SOLR
Dedicated
(w)arcs
Search
Document
Meta Service
SOLR
Dedicated
Search
Generate (w)arcs
Insert meta
information
Indexer
Dedicated
Retrieve (w)arcs and meta
information
WCT
Crawlers
Dedicated
Meta
Database
Web Access
Tag cloud analysis – General Election 2005
•Special Collection 2005 general
election
•147 websites archived
during and immediately
after the UK general
election campaign of 2005.
• Tag clouds (or weighted lists)
generated for websites
belonging to key political parties
• Shows the most frequently
used words in the websites
during the 2005 election
campaign
• Special collection 2010 general
election now available
The value of the haystacks – postcode-based access
1:
2-5:
5+
50+
100+
Blue
Green
Purple
Yellow
Red
Questions?
Thank you.

http://www.webarchive.org.uk

lewis.crawford@bl.uk

@relephantdata
Download