martin_how_to_face_the_challanges_II_4

advertisement
How to Face the Challenges
of Web Archiving?
The experiences of a small library on the edge.
Chloe Martin, Internet Memory
Catherine Ryan, National Library of Ireland
LIBER 2012 - 1
Context:
National Library of Ireland
• Beginnings: Established by the Dublin Science and Museum
Act, 1877
• Mission: “to collect, preserve, promote and make accessible
the documentary and intellectual record of the life of Ireland”.
• The Digital Record: Born Digital Programme established in
2010, covering web archiving.
• Web Archive Projects: 2 pilot projects in 2011
LIBER 2012 - 2
Context:
Internet Memory
European Archive / Internet Memory Foundation
• Established in 2004 in Amsterdam (offices also in Paris)
• Mission: to preserve Web content as a new media for current and
future generations
• Actions: Sensibilization, partnerships, R&D
• Open Access Collections: UK National Archives & Parliament,
PRONI, CERN and The National Library of Ireland
Internet Memory Research
• Spin-off of IM established in June 2011 in Paris
• Missions: to operate large scale or selective crawls & develop new
technologies (crawl, access, processing and extraction)
LIBER 2012 - 3
Web Archiving Project: Project Origins
National Library of Ireland
Building a 21st Century Library:
–
–
–
–
–
Born Digital
Digitisation
Single Integrated Catalogue
Digital Repository
OSCAIL, the Digital Library Programme
LIBER 2012 - 4
Web Archiving Project: Project Origins
National Library of Ireland
Born Digital Materials:
• Natural progression for NLI’s strong political,
cultural and historical collections
• How best to approach this in time of
unprecedented financial difficulty?
• Born Digital Programme established to examine
requirements and produce a policy document for
the next steps
LIBER 2012 - 5
Web Archiving Project: Project Origins
National Library of Ireland
The Hand of History:
– Snap General Election
– Five Weeks
LIBER 2012 - 6
Web Archiving Project: Project Origins
National Library of Ireland
Just do it
LIBER 2012 - 7
Web Archiving Project: Project Origins
National Library of Ireland
Just do it
How?
LIBER 2012 - 8
Web Archiving Project: Project Origins
National Library of Ireland
Collaborative
Partnership:
Requirements:
Partner that suited our
requirements and that
had experience with
others in the cultural
sector
LIBER 2012 - 9
– Technical skills in the
NLI but working on
other projects –
needed these skills
– Leverage NLI’s on
strong curatorial
experience, esp. in
politics
– Fast!
Web Archiving Project: Project Origins
National Library of Ireland
Project phases:
– Project scoping and contract
– Site selection
– Permissions gathering
– QA (look and feel)
– Publication and promotion
LIBER 2012 - 10
Site Selection and Permissions
National Library of Ireland
Selection Criteria:
–
–
–
–
Permissions:
– All sites contacted and
provided with a brief
– Pressurised but
necessary phase
Website presence
Technical reasons
Cut-off date
Women candidates
LIBER 2012 - 11
Scope of projects
National Library of Ireland
General Election:
–
–
–
–
Presidential Election:
Crawl: 200 snapshots
Scope: 100 seeds
Frequency: 2 times
Date: Feb. 2011
LIBER 2012 - 12
–
–
–
–
Crawl: 80 snapshots
Scope: 70 seeds
Frequency: 3 times
Date: Oct-Nov. 2011
Crawl
Internet Memory
• Seeds Validation:
URLs, Duplication, Redirection, External links, Dynamic websites
• Scope Parameters:
Domain, host and path ; Social Web content ; Frequency ; Robots.txt
files exclusion ; Politeness
• Specific incidents  technical changes on the fly
Modification of scope ; Pending crawls ; Adaptation of the politeness
• Improvement of second crawl
LIBER 2012 - 13
Quality Assurance (QA)
National Library of Ireland
•
•
•
•
•
•
Manual QA
Jira software
IM – Technical QA
NLI - ‘Look and Feel’ QA
Multiple browsers
Communication with site owners (building
relationships and promotion)
LIBER 2012 - 14
Quality Assurance (QA)
Internet Memory
• Why?
• How?
• Manual and visual method: homepage + 2
• Resolution of issues
• Temporal Coherence
LIBER 2012 - 15
Access
National Library of Ireland
•
•
•
•
Available to the public
Full text search
IM website – search by keyword, URL
NLI catalogue – keyword via widget
developed by NLI IS team and IM
• Future – access through NLI’s own
interfaces, issue of integrating results
LIBER 2012 - 16
Publication and Promotion
National Library of Ireland
• NLI social media initiative (Twitter and
blog)
• Project participants
• Print media (esp. in area of technology)
• And IM!
• Usage figures have increased but real
value more apparent in 5-10 years
LIBER 2012 - 17
Usage Statistics of Web Archive
National Library of Ireland
Unique visitors per month
1000
900
800
700
600
500
400
300
200
100
0
21/09/2011: Official launch of NLI Web
archives (Tweets)
26/10/2011: Blog post on nli.ie/blog and
Paper in thejournal.ie
25/11/2011: Paper on irishtimes.com
20/01/2012: Paper on irishtimes.com
17/03/2012: Post on
soundofthearchives.wordpress.com
04/05/2012: Paper on irisheconomy.ie
LIBER 2012 - 18
Advantages of Web Archiving
National Library of Ireland
Web archiving:
– New opportunities for delivery of materials to
users
– Work with existing users expectations that
content be online
– Reach new audiences
LIBER 2012 - 19
Advantages of Web Archiving
National Library of Ireland
Political web archives;Irish General Election:
– Researchers can compare online content preand post-election
– Facilitates research into how ‘online’ this
election was
– Assess impact of technological developments
in campaign communications
– Record of campaign information
LIBER 2012 - 20
Benefits of Working Together
National Library of Ireland
Pilot project for a long-term activity:
– Allowed us to enter a new collecting area
despite lack of tech expertise
– Facilitated collection of important material that
one else was collecting
– Collect material quickly
– Leverage curatorial skills
– Gained new technical skills
LIBER 2012 - 21
Benefits of Working Together
Internet Memory
• To supporte the development of Web
archiving initiatives
• To operate rapid deployment of Web
archives
• To address new challenges in this area:
• Social media content
• QA
• Automatization
LIBER 2012 - 22
Conclusion
General Election:
• 18,495,771 URLs
• 1.14 TB
• 10,405 ARCs
Presidential Election:
• 7,333,399 URLs
• 278.10 GB
• 2,513 ARCs
View the NLI collections at:
http://www.nli.ie/en/udlist/digitalcollections.aspx
View the Web archive blog entry at:
http://www.nli.ie/blog/index.php/2011/10
/26/general-election-2011-webarchiving/
View Internet Memory Collections at:
http://collections.europarchive.org/
To be continued…
LIBER 2012 - 23
Questions?
Thanks for your attention!
Catherine Ryan
National Library of Ireland
http://www.nli.ie
cryan@nli.ie
@NLIreland
Chloe Martin
Internet
Memoryhttp://internetmem
ory.org
chloe@internetmemory.net
@InternetMemory
LIBER 2012 - 24
Download