Preserving Web Content at Harvard and Worldwide

advertisement
Preserving Web Content:
Harvard & Worldwide
Andrea Goethals | andrea_goethals@harvard.edu | June 23, 2011
Agenda
PART 1:
The Web
PART 2:
Web Archiving Today
PART 3:
Web Archiving at Harvard
PART 4:
New & Noteworthy
2
The Web
3
1993: “1st” graphical Web browser, Mosaic
Image goes here
|
UIUC NCSA ftp://ftp.ncsa.uiuc.edu/Web/Mosaic/
4
“We knew the web was big…”
• 1998: 1st Google index
– 26 million pages
• 2000: Google index
– 1 billion pages
• 2008: Google link processors
– 1 trillion unique URIs
– “… and the number of individual
Web pages out there is growing
by several billion pages per
day” – from the official Google
blog
|
7/25/2008, Official Google blog <http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html>
“Are You Ready?”
• 2009: estimated at .8
Zettabytes
(1 ZB = 1 billion Terabytes)
• 2010: estimated at 1.2 ZB
• 2020: estimated to grow by a
factor of 44 from 2009, to 35 ZB
| 2010 Digital Universe Study, IDC <http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf>
| The Telegraph May 4, 2010 <http://www.telegraph.co.uk/technology/news/7675214/Digital-universe-to-smash-zettabyte-barrier-for-first-time.html>
6
Outpacing storage
| 2010
Digital Universe Study, IDC <http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf>
7
Some of this content is worth keeping
• Citizen journalism
• Hobby sites
• Religious sites
• News sites
• Courses
• Sports sites
• Personal and
professional blogs
• Social networks
• Social causes
• Networked games
• History of major events
• References
• Humor
• Podcasts
• Scholarly papers
• Maps
• TV shows
• Documents
• Music
• Discussion groups
• Poetry
• Personal sites
• Magazines
• Books
• Multimedia Art
• Organizational
websites
• Photography
• Citizens groups
• Performances
• Research results
• Scientific knowledge
• Political websites
• Government websites
• Internet history
• Comics
• Virtual worlds
• Alternative magazines
• Fashion
• Movies
• Art exhibits
8
Many believe Web content is permanent
|
“Digital Natives Explore Digital Preservation”, Library of Congress <http://www.youtube.com/watch?v=6fhu7s0AfmM>
9
Ever seen this?
404 Not Found
10
Yahoo! Geocities (1994-2009)
11
Web 2.0 Companies: 2006
By Ludwig Gatzke
| Ludwig Gatzke, Posted to flickr on Feb. 19, 2006 <http://www.flickr.com/photos/stabilo-boss/93136022/in/set-72057594060779001/>
12
Web 2.0 Companies: 2009
By Meg Pickard
| Meg
Pickard, Posted to flickr on May 16, 2009 <http://www.flickr.com/photos/meg/3528372602/>
13
Mosaic Hotlist (1995)
• 59 URIs
– 50 HTTP URIs
• 36 Dead
• 13 Live
• 1 Reused
– 9 Gopher URIs
• All Dead
14
A fleeting document of our time period
“Every site on the net has its own unique characteristic and if we forget about them then
no one in the future will have any idea about what we did and what was important to us in
the 21st century.”
|
“America’s Young Archivists”, Library of Congress <http://www.youtube.com/watch?v=Gob7cjzoX3Y>
15
Web archiving today
16
Web archiving 101
• Web harvesting
– Select and capture it
• Preservation of captured Web content
acquisition of
web content
acquisition of
other digital
content
– “Digital preservation”
• Keep it safe
• Keep it usable to people long-term,
despite technological changes
preservation of
web content
preservation of
other digital
content
Anatomy of a Web page
Typically 1 web page =
~35 files
• 17 JPEG images
• 8 GIF images
• 7 CSS files
• 2 Javascript files
• 1 HTML file
(Source: representative
samples taken by then
Internet Archive)
www.harvard.edu (6/8/2011):
58 files
•
19 PNG images
•
13 Javascript files
•
12 GIF images
•
10 JPEG images
•
3 CSS files
•
1 HTML file
18
Web harvesting
• Download all files needed to reproduce the Web page
• Try to capture the original form of the Web page as it would
have been experienced at the time of capture
• Also collect information about the capture process
• Must be some kind of content selection…
Types of harvesting
• Domain harvesting
– Collect the web space of an entire country
• The French Web including the .fr domain
• Selective harvesting
– Collect based on a theme, event, individual, organization, etc.
• The London 2012 Olympics
• Hurricane Katrina
• Women’s blogs
• President Obama
– Planned vs. event-based
Any type of regular harvesting results in a large quantity of
content to manage.
The crawl
Pick a location
(Seed URIs)
Document
exchange
Examine for
URI references
Make a request to
Web server
Receive response
from Web server
Web archiving pioneers: mid-1990s
NL of
Sweden
Internet
Archive
NL of
Denmark
Alexa
Internet
NL of
Australia
NL of
Finland
Collecting
Partners
|
Adapted from A . Potter’s presentation, IIPC GA 2010
NL of
Norway
International Internet Preservation
Consortium (IIPC): 2003Internet
Archive
L and A
Canada
NL of
Sweden
NL of
Denmark
NL of
France
British
Library
IIPC
NL of
Norway
Library of
Congress
NL of
Italy
|
IIPC <http://netpreserve.org>
NL of
Finland
NL of
Australia
NL of
Iceland
IIPC goals
• Enable collection, preservation and long-term access of a rich
body of Internet content from around the world
• Foster development and use of common tools, techniques and
standards
• Be a strong advocate for initiatives and legislation that
encourage the collection, preservation and long-term access to
Internet content
• Encourage and support libraries, archives, museums and
cultural heritage institutions everywhere to address Internet
content collecting and preservation
|
IIPC <http://netpreserve.org>
IIPC: 2011
NL of
China
NL of
Singapore
BANQ
Canada
L and A
Canada
Bib.
Alexandrina
WAC
(UK)
NL of
Israel
NL of
Japan
NL of
Korea
Hanzo
Archives
TNA
(UK)
British
Library
/ UK
Harvard
Library
UNT
(US)
NYU
(US)
Library of
Congress
CDL
(US)
OCLC
Collecting
Partners
AZ
AI Lab
(US)
Internet
Memory
Foundation
NL of
Iceland
NL of
Finland
NL of
Australia
Collecting
Partners
NL of
Croatia
NL of
Norway
NL of
NZ
NL of
Austria
NL of
Spain /
Catalunya
NL of
France
/ INA
IIPC
UIUC
(US)
Collecting
Partners
Denmark
Internet
Archive
Archive-It
Partners
GPO
(US)
NL of
Sweden
NL of
Scotland
NL of
Netherlands
/ VKS
NL of
Poland
NL of
Germany
NL of
Slovenia
NL of
Italy
NL of
Switzerland
NL of
Czech
Republic
Current methods of harvesting
• Contract with other parties for crawls
– Internet Archive’s crawls for the Library of Congress
• Use a hosted service
– Archive-It (provided by the Internet Archive)
– Web Archiving Service (WAS) (provided by California
Digital Library)
• Set up institutional web archiving systems
– Harvard’s Web Archiving Collection Service (WAX)
– Most use IIPC tools like the Heritrix web crawler
26
Current methods of access
• Currently dark – no access (Norway, Slovenia)
• Only on-site (BnF, Finland, Austria)
• Public online access (Harvard, LAC, some LC collections)
• What kind of access?
– Most common: browse as it was & URL search
– Sometimes: also full text search
– Very rare: bulk access for research
– Nonexistent: cross-institutional web archive discovery/access
27
Current big challenges
• Legal
– High value content locked up in gated communities (Facebook);
who owns what?
• Technical
– The Web keeps morphing; so must our capture tools
– Big data--requires very scalable infrastructure (indexing,
de-duplication, format identification, …)
• Organizational
– Web archiving is very resource intensive and competes with
other institutional priorities
• Preservation
– Many different formats; complex interconnected content;
high-maintenance rendering requirements
28
Web archiving at Harvard
Web Archiving Collection Service (WAX)
• Used by “curators” within Harvard units (departments,
libraries, museums, etc.) to collect and preserve Web content
• Content selection is a local choice
• The system is managed centrally by OIS
• The content is publicly available to current and future users
• The content is preserved in the Digital Repository Service
(DRS) managed by OIS
WAX workflow
• A Harvard unit sets up an account (one-time event)
• On an on-going basis:
– Curators within that unit specify and schedule content to crawl
– WAX crawlers capture the content
– Curators QA the resulting Web harvests
– Curators organize the Web harvests into collections
– Curators make the collections discoverable
– Curators push content to the DRS – becomes publicly viewable
and searchable
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
Back-end services
• WAX crawlers
• File Movers
• Importer
• Deleter
• Archiver
• Indexers
WAX
curator
WAXi
curator
interface
WAX temp
storage
temp
index
back-end
services
HOLLIS
catalog
archive
user
production
index
WAX
public
interface
Front end
DRS
(preservation
repository)
Back end
Catalog record
Minimally at the collection level
Sometimes also at the Web site
level
http://wax.lib.harvard.edu
New & noteworthy
39
Web Continuity Project
• The problem: 60% of the links cited in British Parliamentary
debate transcripts dating from 1996-2006 were broken
• The solution: when broken links are found on UK governmental
websites deliver archived versions
• Sites are crawled 3 times/year → UK Government Web Archive
• When users click on a dead link on the live governmental site it
automatically redirects to an archived version of that page
* patches the present with the past *
|
Web Continuity Project < http://www.nationalarchives.gov.uk/information-management/policies/web-continuity.htm> 40
Memento
• The problem: the Web of the past, even where it exists, is very
difficult to access compared to the Web of the present
• The solution: leverage Internet protocols and existing stores of
past Web resources (Web archives, content management
systems, revision control systems) to allow a user to specify a
desired past date of the Web resource to return
• Give me http://www.ietf.org as it existed around 1997
• LANL, Old Dominion, Harding University; funded by LC’s
NDIIPP
* a bridge from the present to the past *
|
Memento <http://www.mementoweb.org>
41
Memento
Viewing the live Web
|
Memento <http://www.mementoweb.org>
Viewing the past Web
42
Memento example
• Using the Memento brower plugin
• User sends a GET/HEAD request to http://www.ietf.org
• A “timegate” is returned
• User sends a request to the timegate requesting
http://www.ietf.org around 1997
• A new HTTP request header “Accept-Datetime”
• The timegate returns
http://web.archive.org/web/19970107171109/http://www.ietf.org/
• And a new response header “Memento-Datetime” to
indicate the date the URI was captured
|
Memento <http://www.mementoweb.org>
43
Data mining & analytics
• Extract information/insight from a corpus of data (Web archives)
• Can help researchers answer interesting questions about society,
technology & media use, language, …
• This information can enable better UIs for users
• Geospatial maps, tag clouds, classification, facets, rate of
change
• Technical platform & tools for research
• Hadoop distributed file system
• Map Reduce
• Google Refine
• Pig Latin (scripting)
• IBM BigSheets
44
How did a blogosphere form?
Esther Weltervrede and Anne Helmond
45
Where do bloggers blog?
Esther Weltervrede and Anne Helmond
46
Shift from linklogs & lifelogs to
platformlogs
47
Collaborative collections
• End of Term (EOT) collection (2008)
• Before & after federal government’s public Web presence
• UNT, LC, CDL, IA, GPO
• Winter Olympics 2010
• IA, LC, CDL, BL, UNT, BnF
• EOT and presidential elections (2011-12)
• UNT, LC, CDL, IA, Harvard
• Olympic & Paralympic Games 2012
• BL, ?
|
Winter Olympics 2010 <http://webarchives.cdlib.org/a/2010olympics>
& Paralympic Games 2012 <http://www.webarchive.org.uk/ukwa/collection/4325386/page/1>
| Olympic
48
Emulation / KEEP Project
• Problem: how to preserve access to obsolete Web formats
• (One) solution: emulate Web browsers
Related projects:
• Keeping Emulation Environments Portable (KEEP)
• The emulation infrastructure
• Knowledgebase of typical Web client environments by year
• What was a typical for a given year?
49
KEEP Emulation Framework
• User requests digital file in an
obsolete format
• The system selects and runs the
best available emulator and sets
up the software dependencies
(OS, apps, plug-ins, drivers)
• The emulators run on a KEEP
Virtual Machine (VM) so that
only the VM needs to be ported
over time, not the emulators
• 9 European institutions, led by
the KB; EU-funded
|
KEEP <http://www.keep-project.eu>
external
technical
registries
GUI
EF
core
engine
digital
file
emulator
archive
SW
archive
KEEP Virtual Machine (portability)
50
Typical Web environments
• Knowledgebase of typical Web client environments by year
• Web browsers, plug-ins, operating systems, applications
• IIPC PWG
51
Risk assessment tools
• Knowledgebase of threats to preserving Web content
• Example: viruses
• Related annotated bibliography of risk references
• Knowledgebase of risk mitigation strategies
• Example: virus checking, quarantine at ingest, effective firewalls
• Online tool for institutions to:
• Assess risks
• View other institutions’ risk assessments
• (Future): Analyze risk assessments
• IIPC PWG: Lead: Harvard; Participants: LC, BnF, NLNZ, LAC, KB
52
53
Thank you.
Andrea Goethals
andrea_goethals@harvard.edu | June 23, 2011
Download