Preserving Web-based digital material Andrea Goethals Harvard University Library Why Books? Site Visit 28 October 2010 Agenda 1. Why preserve Web content? 2. A look at the Web 3. Web archiving 4. Web archiving at Harvard 5. Open challenges in Web archiving 6. Questions? 1. Why preserve Web content? Books have moved off the shelves and onto the Web! A few other things on the Web… TV Shows Blogs Images Scholarly papers Stores Discussions Maps Virtual worlds Art exhibits Documents … Music Articles Magazines Newspapers Tutorials Software Databases Social networking Advertising Courses … Museums Libraries Archives Recipes Data sets Oral history Poetry Broadcasts Wikis Movies … But is it valuable? May be historically significant White House web site March 20, 2003 May be the only version Harvard Magazine May/June 2009 May document human behavior World of Warcraft, Fizzcrank realm, Morc the Orc’s view, Oct. 25 2010 Important to researchers ABC News Aug. 2007 Important to researchers Strangers and friends: collaborative play in world of warcraft From tree house to barracks: The social life of guilds in World of Warcraft The life and death of online gaming communities: a look at guilds in World of Warcraft Learning conversations in World of Warcraft The ideal elf: Identity exploration in World of Warcraft Traffic analysis and modeling for world of warcraft E-collaboration and e-commerce in virtual worlds: The potential of second life and world of warcraft Understanding social interaction in world of warcraft Communication, coordination, and camaraderie in World of Warcraft An online community as a new tribalism: The world of warcraft A hybrid cultural ecology: world of warcraft in China … etc. May be a work of art YouTube Play. A Biennial of Creative Video (Oct. 2010 -) May be important data for scholarship NOAA Satellite and Information Service May be an important reference May be of personal value 2. A look at the Web Remember this? 1993: “First” graphical Web browser (Mosaic) Volume of content is immense! • 1998: First Google index has 26 million pages • 2000: Google index has 1 billion pages • 2008: Google processes 1 trillion unique URLs • “… and the number of individual Web pages out there is growing by several billion pages per day” (Source: the official Google blog) Prolific self-publishers “Humanity’s total digital output currently stands at 8,000,000 petabytes … but is expected to pass 1.2 zettabytes this year. One zettabyte is equal to one million terabytes…” “Around 70 per cent of the world’s digital content is generated by individuals, but it is stored by companies on content-sharing websites such as Flickr and YouTube.” Telegraph.co.uk May 2010 on IDC study Ever-increasing # of web sites 96 million out of 233 million web sites are active (Netcraft.com) A moving target Flickr (Feb 2004) Facebook (Feb 2004) YouTube (Feb 2005) Twitter (2006) Anatomy of a web page Typically 1 web page = ~35 files • 1 HTML file • 7 text/css • 8 image/gif • 17 image/jpeg • 2 javascript Source: representative samples taken by Internet Archive Can’t rely on it always being out there Web content is transient The average lifespan of a web site is between 44 and 100 days Captured April 8, 2009 Visited October 13, 2010 Disappearing web sites 2000 Sydney Olympics Most of the Web record is only held by the National Library of Australia Half of the URLs cited in D-Lib Magazine inaccessible 10 years after publication (McCown et al., 2005) 3. Web archiving Web archiving 101 1. Web harvesting 2. Select and capture it Preservation of captured Web content acquisition of web content acquisition of other digital content “Digital preservation” Keep it safe Keep it usable to people long-term, despite technological changes preservation of web content preservation of other digital content Web harvesting Download all files needed to reproduce the Web page Try to capture the original form of the Web page as it would have been experienced at the time of capture Also collect information about the capture process Must be some kind of selection… Type of harvesting Domain harvesting Collect the web space of an entire country The French Web including the .fr domain Selective harvesting Collect based on a theme, event, individual, organization, etc. The London 2012 Olympics Hurricane Katrina Women’s blogs President Obama Any type of regular harvesting results in a large quantity of content to manage. The crawl Pick a location (Seed URIs) Document exchange Examine for URI references Make a request to Web server Receive response from Web server Web archiving pioneers: mid-1990s NL of Sweden Internet Archive NL of Denmark Alexa Internet NL of Australia NL of Finland Collecting Partners NL of Norway Adapted from A. Potter’s presentation, IIPC GA 2010 International Internet Preservation Consortium (IIPC): 2003Internet Archive L and A Canada NL of Sweden NL of Denmark NL of France British Library IIPC NL of Norway Library of Congress NL of Italy NL of Finland NL of Australia IIPC: http://netpreserve.org NL of Iceland IIPC goals Facilitate preservation of a rich body of Internet content from around the world Develop common tools, techniques and standards Encourage and support Internet archiving and preservation IIPC: http://netpreserve.org WAC (UK) IIPC: 2010 NL of Israel NL of Singapore BANQ Canada L and A Canada Archive-It Partners NL of Japan NL of Korea Hanzo Archives TNA (UK) British Library / UK UNT (US) NYU (US) NL of Scotland Denmark IIPC Library of Congress CDL (US) UIUC (US) Collecting Partners OCLC AZ AI Lab (US) Collecting Partners European Archive NL of Iceland NL of Finland NL of Australia Collecting Partners NL of Croatia NL of Norway NL of NZ NL of Austria NL of Spain / Catalunya NL of France / INA Internet Archive Harvard GPO (US) NL of Sweden NL of Netherlands NL of Poland NL of Germany NL of Slovenia NL of Italy NL of Switzerland NL of Czech Republic Adapted from A. Potter’s presentation, IIPC GA 2010 Current methods of harvesting Contract with another party for crawls Internet Archive’s crawls for the Library of Congress Use a hosted service Internet Archive’s ArchiveIt California Digital Library’s Web Archiving Service (WAS) Set up institution-specific web archiving systems Harvard’s Web Archiving Collection Service (WAX) Most use IIPC tools like the Heritrix web crawler Current methods of access Currently dark – no access (e.g. Norway) Only on-site to researchers (e.g. BnF, Finland) Public on-line access (e.g. Harvard, LAC) What kind of access? Most common: browse as it was Sometimes: full text search Very rare: bulk access for research Non-existent: cross-web archive access http://netpreserve.org/about/archiveList.php 4. Web archiving at Harvard Web Archiving Collection Service (WAX) Used by “curators” within Harvard units (departments, libraries, museums, etc.) to collect and preserve Web content Content selection is a local choice The content is publicly available to current and future users WAX workflow A Harvard unit sets up an account (one-time event) On an on-going basis: Curators within that unit specify and schedule content to crawl WAX crawlers capture the content Curators QA the Web harvests Curators organize the Web harvests into collections Curators make the collections discoverable Curators push content to the DRS – becomes publicly viewable and searchable WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end Back-end services WAX crawlers File Movers Importer Deleter Archiver Indexers WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end Catalog record Minimally at the collection level Sometimes also at the Web site level http://wax.lib.harvard.edu 5. Open challenges in Web archiving How do we capture…? Streaming media (e.g. videos) Non-http protocols (RTMP, etc.), sometimes proprietary Experiments to capture video content in parallel to regular crawls (e.g. BL’s One & Other project) Complicates play-back as well Still experimental, non-scalable and timeconsuming How do we capture…? Highly interactive sites (Flash, AJAX) “Walled gardens” Experiments to launch Web browsers that can simulate Web clicks (INA, European Archive) Still experimental and time-consuming Need help from content hosts What’s next? The Web keeps changing How do we do…? Quality Assurance (QA) Too time-consuming to manually check everything Early experiments with automated QA in combination with some manual QA (IA, BL) How do we provide access in the future given its…? Complex rendering requirements Many formats – dependent on different players and plug-ins Potential solutions (IIPC PWG) Prioritize format work and tools based on annual format surveys Experiments with migration and emulation Format, browser and software knowledgebases 6. Questions? How do we preserve it given its…? Separate body of digital content to preserve Different infrastructure and staff Not actively preserved (the status quo for Web archives) Potential solutions: Integrate with digital preservation repositories (e.g. Harvard, NLNZ) Leverages existing infrastructure, processes, staff Keep separate but consult digital preservation staff Why is this important? “Online video is the fastest growing creative realm on the Internet… “… puts the Guggenheim and YouTube at the forefront of technology and creativity.” “The Internet is changing the creation and distribution of digital media.” “With the democratization of production tools and the ability to create works that reach and are shared by millions of people, online video deserves the kind of critical focus this project will bring to bear.” Excerpts from the YouTube Play FAQ How do we capture…? Good representations of Web sites Time-consuming and expensive to capture every variation of a site Potential solution: scheduled snapshots based on knowledge of meaningful change rate (e.g. Harvard University Archives’ Fall and Spring crawls) Temporal coherence – site changes while its being crawled Who’s responsible? Inability to determine who is responsible for preserving which Web content Larger problem for collectively-produced content Potential strategies Collaborative collections (e.g. Hurricane Katrina, London Olympics) Different roles based on institutional expertise (seed selection, crawling, storage/preservation) Partner with major content providers LC: Twitter archive How do we eliminate…? Web spam Intentional and unintentional crawler traps Potential solutions: Spam filters during or after a crawl Duplicate content Exact copies of content previously captured Within a harvest – heritrix already de-dupes Among harvests – a “smart crawler” version of heritrix exists What should we collect? Inability to determine now what will be valuable in the future Potential strategies Only do large domain crawls But there’s a price to pay for these crawls! Internet Archive Swedish National Library Library and Archives Canada Selective crawls complemented with periodic broad domain crawls (e.g. BnF, Denmark) How do we describe it given its…? Volume Prohibits technical metadata description and storage But technical metadata is necessary to know what you have and to plan its preservation Limited amounts of metadata (Harvard – formats, admin flags)