Preserving Web-based digital material

advertisement
Preserving Web-based
digital material
Andrea Goethals
Harvard University Library
Why Books? Site Visit
28 October 2010
Agenda
1.
Why preserve Web content?
2.
A look at the Web
3.
Web archiving
4.
Web archiving at Harvard
5.
Open challenges in Web archiving
6.
Questions?
1. Why preserve Web content?
Books have moved off the shelves and
onto the Web!
A few other things on the Web…











TV Shows
Blogs
Images
Scholarly papers
Stores
Discussions
Maps
Virtual worlds
Art exhibits
Documents
…












Music

Articles

Magazines

Newspapers

Tutorials

Software

Databases
Social networking 

Advertising

Courses

…
Museums
Libraries
Archives
Recipes
Data sets
Oral history
Poetry
Broadcasts
Wikis
Movies
…
But is it valuable?
May be historically significant
White House
web site
March 20, 2003
May be the only version
Harvard Magazine May/June 2009
May document human behavior
World of Warcraft, Fizzcrank realm, Morc the Orc’s view, Oct. 25 2010
Important to researchers
ABC News Aug. 2007
Important to researchers












Strangers and friends: collaborative play in world of warcraft
From tree house to barracks: The social life of guilds in World
of Warcraft
The life and death of online gaming communities: a look at
guilds in World of Warcraft
Learning conversations in World of Warcraft
The ideal elf: Identity exploration in World of Warcraft
Traffic analysis and modeling for world of warcraft
E-collaboration and e-commerce in virtual worlds: The
potential of second life and world of warcraft
Understanding social interaction in world of warcraft
Communication, coordination, and camaraderie in World of
Warcraft
An online community as a new tribalism: The world of
warcraft
A hybrid cultural ecology: world of warcraft in China
… etc.
May be a work of art
YouTube Play. A Biennial of Creative Video (Oct. 2010 -)
May be important data for scholarship
NOAA Satellite and Information Service
May be an important reference
May be of personal value
2. A look at the Web
Remember this?

1993: “First” graphical Web browser (Mosaic)
Volume of content is immense!
• 1998: First Google index has
26 million pages
• 2000: Google index has 1
billion pages
• 2008: Google processes 1
trillion unique URLs
• “… and the number of
individual Web pages out
there is growing by several
billion pages per day”
(Source: the official Google blog)
Prolific self-publishers
“Humanity’s total digital output
currently stands at 8,000,000
petabytes … but is expected to
pass 1.2 zettabytes this year.
One zettabyte is equal to one
million terabytes…”
“Around 70 per cent of the
world’s digital content is
generated by individuals, but it
is stored by companies on
content-sharing websites such as
Flickr and YouTube.”
Telegraph.co.uk May 2010 on IDC study
Ever-increasing # of web sites
96 million out of 233 million web sites are active (Netcraft.com)
A moving target




Flickr (Feb 2004)
Facebook (Feb 2004)
YouTube (Feb 2005)
Twitter (2006)
Anatomy of a web page
Typically
1 web page = ~35
files
• 1 HTML file
• 7 text/css
• 8 image/gif
• 17 image/jpeg
• 2 javascript
Source: representative samples taken by Internet Archive
Can’t rely on it always being out there
Web content is transient

The average lifespan of a web site is
between 44 and 100 days
Captured April 8, 2009
Visited October 13, 2010
Disappearing web sites

2000 Sydney
Olympics

Most of the Web
record is only held
by the National
Library of Australia
Half of the URLs cited in D-Lib Magazine
inaccessible 10 years after publication

(McCown et al., 2005)
3. Web archiving
Web archiving 101
1.
Web harvesting

2.
Select and capture it
Preservation of captured
Web content

acquisition of
web content
acquisition of
other digital
content
“Digital preservation”


Keep it safe
Keep it usable to people
long-term, despite
technological changes
preservation of
web content
preservation of
other digital
content
Web harvesting




Download all files needed to reproduce
the Web page
Try to capture the original form of the
Web page as it would have been
experienced at the time of capture
Also collect information about the capture
process
Must be some kind of selection…
Type of harvesting

Domain harvesting

Collect the web space of an entire country


The French Web including the .fr domain
Selective harvesting

Collect based on a theme, event, individual,
organization, etc.




The London 2012 Olympics
Hurricane Katrina
Women’s blogs
President Obama
Any type of regular harvesting results in a large quantity of
content to manage.
The crawl
Pick a location
(Seed URIs)
Document
exchange
Examine for
URI references
Make a request to
Web server
Receive response
from Web server
Web archiving pioneers: mid-1990s
NL of
Sweden
Internet
Archive
NL of
Denmark
Alexa
Internet
NL of
Australia
NL of
Finland
Collecting
Partners
NL of
Norway
Adapted from A. Potter’s presentation, IIPC GA 2010
International Internet Preservation
Consortium (IIPC): 2003Internet
Archive
L and A
Canada
NL of
Sweden
NL of
Denmark
NL of
France
British
Library
IIPC
NL of
Norway
Library of
Congress
NL of
Italy
NL of
Finland
NL of
Australia
IIPC: http://netpreserve.org
NL of
Iceland
IIPC goals
Facilitate preservation of a rich body of
Internet content from around the world
 Develop common tools, techniques and
standards
 Encourage and support Internet archiving
and preservation

IIPC: http://netpreserve.org
WAC
(UK)
IIPC: 2010
NL of
Israel
NL of
Singapore
BANQ
Canada
L and A
Canada
Archive-It
Partners
NL of
Japan
NL of
Korea
Hanzo
Archives
TNA
(UK)
British
Library
/ UK
UNT
(US)
NYU
(US)
NL of
Scotland
Denmark
IIPC
Library of
Congress
CDL
(US)
UIUC
(US)
Collecting
Partners
OCLC
AZ
AI Lab
(US)
Collecting
Partners
European
Archive
NL of
Iceland
NL of
Finland
NL of
Australia
Collecting
Partners
NL of
Croatia
NL of
Norway
NL of
NZ
NL of
Austria
NL of
Spain /
Catalunya
NL of
France
/ INA
Internet
Archive
Harvard
GPO
(US)
NL of
Sweden
NL of
Netherlands
NL of
Poland
NL of
Germany
NL of
Slovenia
NL of
Italy
NL of
Switzerland
NL of
Czech
Republic
Adapted from A. Potter’s presentation, IIPC GA 2010
Current methods of harvesting
 Contract with another party for crawls
 Internet Archive’s crawls for the Library of
Congress
 Use a hosted service
 Internet Archive’s ArchiveIt
 California Digital Library’s Web Archiving Service
(WAS)
 Set
up institution-specific web
archiving systems


Harvard’s Web Archiving Collection Service (WAX)
Most use IIPC tools like the Heritrix web crawler
Current methods of access
Currently dark – no access (e.g. Norway)
 Only on-site to researchers (e.g. BnF,
Finland)
 Public on-line access (e.g. Harvard, LAC)


What kind of access?




Most common: browse as it was
Sometimes: full text search
Very rare: bulk access for research
Non-existent: cross-web archive access
http://netpreserve.org/about/archiveList.php
4. Web archiving at
Harvard
Web Archiving Collection Service
(WAX)
Used by “curators” within Harvard units
(departments, libraries, museums, etc.) to
collect and preserve Web content
 Content selection is a local choice
 The content is publicly available to current
and future users

WAX workflow


A Harvard unit sets up an account (one-time
event)
On an on-going basis:






Curators within that unit specify and schedule content to
crawl
WAX crawlers capture the content
Curators QA the Web harvests
Curators organize the Web harvests into collections
Curators make the collections discoverable
Curators push content to the DRS – becomes publicly
viewable and searchable
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
Back-end services
WAX crawlers
 File Movers
 Importer
 Deleter
 Archiver
 Indexers

WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
Catalog record

Minimally at the collection
level

Sometimes also at the Web
site level
http://wax.lib.harvard.edu
5. Open challenges in Web
archiving
How do we capture…?

Streaming media (e.g. videos)




Non-http protocols (RTMP, etc.), sometimes
proprietary
Experiments to capture video content in
parallel to regular crawls (e.g. BL’s One &
Other project)
Complicates play-back as well
Still experimental, non-scalable and timeconsuming
How do we capture…?

Highly interactive sites (Flash, AJAX)



“Walled gardens”


Experiments to launch Web browsers that can
simulate Web clicks (INA, European Archive)
Still experimental and time-consuming
Need help from content hosts
What’s next? The Web keeps changing
How do we do…?

Quality Assurance (QA)


Too time-consuming to manually check
everything
Early experiments with automated QA in
combination with some manual QA (IA, BL)
How do we provide access in the
future given its…?

Complex rendering requirements


Many formats – dependent on different players
and plug-ins
Potential solutions (IIPC PWG)



Prioritize format work and tools based on
annual format surveys
Experiments with migration and emulation
Format, browser and software knowledgebases
6. Questions?
How do we preserve it given its…?

Separate body of digital content to
preserve



Different infrastructure and staff
Not actively preserved (the status quo for Web
archives)
Potential solutions:

Integrate with digital preservation repositories (e.g.
Harvard, NLNZ)
 Leverages existing infrastructure, processes, staff

Keep separate but consult digital preservation staff
Why is this important?




“Online video is the fastest growing creative
realm on the Internet…
“… puts the Guggenheim and YouTube at the
forefront of technology and creativity.”
“The Internet is changing the creation and
distribution of digital media.”
“With the democratization of production tools and
the ability to create works that reach and are
shared by millions of people, online video
deserves the kind of critical focus this project will
bring to bear.”
Excerpts from the YouTube Play FAQ
How do we capture…?

Good representations of Web sites

Time-consuming and expensive to capture
every variation of a site


Potential solution: scheduled snapshots based on
knowledge of meaningful change rate (e.g. Harvard
University Archives’ Fall and Spring crawls)
Temporal coherence – site changes while its
being crawled
Who’s responsible?

Inability to determine who is responsible
for preserving which Web content


Larger problem for collectively-produced
content
Potential strategies

Collaborative collections (e.g. Hurricane
Katrina, London Olympics)


Different roles based on institutional expertise (seed
selection, crawling, storage/preservation)
Partner with major content providers

LC: Twitter archive
How do we eliminate…?

Web spam


Intentional and unintentional crawler traps
Potential solutions:


Spam filters during or after a crawl
Duplicate content

Exact copies of content previously captured


Within a harvest – heritrix already de-dupes
Among harvests – a “smart crawler” version of
heritrix exists
What should we collect?
Inability to determine now what will be
valuable in the future
 Potential strategies


Only do large domain crawls

But there’s a price to pay for these crawls!
 Internet Archive
 Swedish National Library
 Library and Archives Canada

Selective crawls complemented with periodic
broad domain crawls (e.g. BnF, Denmark)
How do we describe it given its…?

Volume


Prohibits technical metadata description and
storage
But technical metadata is necessary to know
what you have and to plan its preservation

Limited amounts of metadata (Harvard – formats,
admin flags)
Download