Access and Analytics to the UK Web Archive Lewis Crawford, Web Archive Technical Lead The British Library Introduction This talk will cover: Background of the UK Web Archive Traditional access methods to Web Archives Full text search for resource discovery Problems of scale – needles and haystacks Web Archiving: the basics What Selecting, capturing, storing, preserving and managing access to snapshots of websites over time How Use crawler software to download websites automatically Selective or domain archiving Provide access in a Web Archive When Since mid 1990s Who Heritage and memory organisations, eg (IIPC) University libraries Not-for-profit and commercial organisations, eg Internet Archive Individual researchers Why Global information resource Artefact of cultural and technology change Representative sample of the web: historical and sociological data that may not be found elsewhere Part of national digital heritage - legal requirements UK Web Archive: Web archive as historical documents 5 Multimedia based content 3D visualisation wall Full text search N-gram visualisation N-gram visualisation Media based results Semantic analysis Scale: needle and haystack Subject hierarchy visualisation UK Web Archive ~ 10,000 websites collected since 2004 ~ 40,000 instances Google: “seen 1 trillion unique URLs” more than a billion new pages are added to the web every day The UK web domain 9 million .uk domain names registered in December 2010 ~ 1 million using other domain names Growing at 11% - 14% per year 40% estimated to be in scope for Legal Deposit Estimated ~110TB each UK domain crawl The value of the haystacks – content visualisation Big Data analytics Java Map/Reduce to use Tika to extract text and generate XML files for Solr ingest Hive & Pig for ad hoc query analysis Search indexing process Node 1 XML Media store SOLR DIH Indexes new xml Generate xml files XML Image store SOLR DIH Indexes new xml Hadoop XML Document store SOLR DIH Indexes new xml Node 50 Dedicated Indexer Indexer Replication Replication SOLR Dedicated Search Replication SOLR Dedicated (w)arcs Search Document Meta Service SOLR Dedicated Search Generate (w)arcs Insert meta information Indexer Dedicated Retrieve (w)arcs and meta information WCT Crawlers Dedicated Meta Database Web Access Tag cloud analysis – General Election 2005 •Special Collection 2005 general election •147 websites archived during and immediately after the UK general election campaign of 2005. • Tag clouds (or weighted lists) generated for websites belonging to key political parties • Shows the most frequently used words in the websites during the 2005 election campaign • Special collection 2010 general election now available The value of the haystacks – postcode-based access 1: 2-5: 5+ 50+ 100+ Blue Green Purple Yellow Red Questions? Thank you. http://www.webarchive.org.uk lewis.crawford@bl.uk @relephantdata