How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library of Ireland LIBER 2012 - 1 Context: National Library of Ireland • Beginnings: Established by the Dublin Science and Museum Act, 1877 • Mission: “to collect, preserve, promote and make accessible the documentary and intellectual record of the life of Ireland”. • The Digital Record: Born Digital Programme established in 2010, covering web archiving. • Web Archive Projects: 2 pilot projects in 2011 LIBER 2012 - 2 Context: Internet Memory European Archive / Internet Memory Foundation • Established in 2004 in Amsterdam (offices also in Paris) • Mission: to preserve Web content as a new media for current and future generations • Actions: Sensibilization, partnerships, R&D • Open Access Collections: UK National Archives & Parliament, PRONI, CERN and The National Library of Ireland Internet Memory Research • Spin-off of IM established in June 2011 in Paris • Missions: to operate large scale or selective crawls & develop new technologies (crawl, access, processing and extraction) LIBER 2012 - 3 Web Archiving Project: Project Origins National Library of Ireland Building a 21st Century Library: – – – – – Born Digital Digitisation Single Integrated Catalogue Digital Repository OSCAIL, the Digital Library Programme LIBER 2012 - 4 Web Archiving Project: Project Origins National Library of Ireland Born Digital Materials: • Natural progression for NLI’s strong political, cultural and historical collections • How best to approach this in time of unprecedented financial difficulty? • Born Digital Programme established to examine requirements and produce a policy document for the next steps LIBER 2012 - 5 Web Archiving Project: Project Origins National Library of Ireland The Hand of History: – Snap General Election – Five Weeks LIBER 2012 - 6 Web Archiving Project: Project Origins National Library of Ireland Just do it LIBER 2012 - 7 Web Archiving Project: Project Origins National Library of Ireland Just do it How? LIBER 2012 - 8 Web Archiving Project: Project Origins National Library of Ireland Collaborative Partnership: Requirements: Partner that suited our requirements and that had experience with others in the cultural sector LIBER 2012 - 9 – Technical skills in the NLI but working on other projects – needed these skills – Leverage NLI’s on strong curatorial experience, esp. in politics – Fast! Web Archiving Project: Project Origins National Library of Ireland Project phases: – Project scoping and contract – Site selection – Permissions gathering – QA (look and feel) – Publication and promotion LIBER 2012 - 10 Site Selection and Permissions National Library of Ireland Selection Criteria: – – – – Permissions: – All sites contacted and provided with a brief – Pressurised but necessary phase Website presence Technical reasons Cut-off date Women candidates LIBER 2012 - 11 Scope of projects National Library of Ireland General Election: – – – – Presidential Election: Crawl: 200 snapshots Scope: 100 seeds Frequency: 2 times Date: Feb. 2011 LIBER 2012 - 12 – – – – Crawl: 80 snapshots Scope: 70 seeds Frequency: 3 times Date: Oct-Nov. 2011 Crawl Internet Memory • Seeds Validation: URLs, Duplication, Redirection, External links, Dynamic websites • Scope Parameters: Domain, host and path ; Social Web content ; Frequency ; Robots.txt files exclusion ; Politeness • Specific incidents technical changes on the fly Modification of scope ; Pending crawls ; Adaptation of the politeness • Improvement of second crawl LIBER 2012 - 13 Quality Assurance (QA) National Library of Ireland • • • • • • Manual QA Jira software IM – Technical QA NLI - ‘Look and Feel’ QA Multiple browsers Communication with site owners (building relationships and promotion) LIBER 2012 - 14 Quality Assurance (QA) Internet Memory • Why? • How? • Manual and visual method: homepage + 2 • Resolution of issues • Temporal Coherence LIBER 2012 - 15 Access National Library of Ireland • • • • Available to the public Full text search IM website – search by keyword, URL NLI catalogue – keyword via widget developed by NLI IS team and IM • Future – access through NLI’s own interfaces, issue of integrating results LIBER 2012 - 16 Publication and Promotion National Library of Ireland • NLI social media initiative (Twitter and blog) • Project participants • Print media (esp. in area of technology) • And IM! • Usage figures have increased but real value more apparent in 5-10 years LIBER 2012 - 17 Usage Statistics of Web Archive National Library of Ireland Unique visitors per month 1000 900 800 700 600 500 400 300 200 100 0 21/09/2011: Official launch of NLI Web archives (Tweets) 26/10/2011: Blog post on nli.ie/blog and Paper in thejournal.ie 25/11/2011: Paper on irishtimes.com 20/01/2012: Paper on irishtimes.com 17/03/2012: Post on soundofthearchives.wordpress.com 04/05/2012: Paper on irisheconomy.ie LIBER 2012 - 18 Advantages of Web Archiving National Library of Ireland Web archiving: – New opportunities for delivery of materials to users – Work with existing users expectations that content be online – Reach new audiences LIBER 2012 - 19 Advantages of Web Archiving National Library of Ireland Political web archives;Irish General Election: – Researchers can compare online content preand post-election – Facilitates research into how ‘online’ this election was – Assess impact of technological developments in campaign communications – Record of campaign information LIBER 2012 - 20 Benefits of Working Together National Library of Ireland Pilot project for a long-term activity: – Allowed us to enter a new collecting area despite lack of tech expertise – Facilitated collection of important material that one else was collecting – Collect material quickly – Leverage curatorial skills – Gained new technical skills LIBER 2012 - 21 Benefits of Working Together Internet Memory • To supporte the development of Web archiving initiatives • To operate rapid deployment of Web archives • To address new challenges in this area: • Social media content • QA • Automatization LIBER 2012 - 22 Conclusion General Election: • 18,495,771 URLs • 1.14 TB • 10,405 ARCs Presidential Election: • 7,333,399 URLs • 278.10 GB • 2,513 ARCs View the NLI collections at: http://www.nli.ie/en/udlist/digitalcollections.aspx View the Web archive blog entry at: http://www.nli.ie/blog/index.php/2011/10 /26/general-election-2011-webarchiving/ View Internet Memory Collections at: http://collections.europarchive.org/ To be continued… LIBER 2012 - 23 Questions? Thanks for your attention! Catherine Ryan National Library of Ireland http://www.nli.ie cryan@nli.ie @NLIreland Chloe Martin Internet Memoryhttp://internetmem ory.org chloe@internetmemory.net @InternetMemory LIBER 2012 - 24