Preserving Web Content: Harvard & Worldwide Andrea Goethals | andrea_goethals@harvard.edu | June 23, 2011 Agenda PART 1: The Web PART 2: Web Archiving Today PART 3: Web Archiving at Harvard PART 4: New & Noteworthy 2 The Web 3 1993: “1st” graphical Web browser, Mosaic Image goes here | UIUC NCSA ftp://ftp.ncsa.uiuc.edu/Web/Mosaic/ 4 “We knew the web was big…” • 1998: 1st Google index – 26 million pages • 2000: Google index – 1 billion pages • 2008: Google link processors – 1 trillion unique URIs – “… and the number of individual Web pages out there is growing by several billion pages per day” – from the official Google blog | 7/25/2008, Official Google blog <http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html> “Are You Ready?” • 2009: estimated at .8 Zettabytes (1 ZB = 1 billion Terabytes) • 2010: estimated at 1.2 ZB • 2020: estimated to grow by a factor of 44 from 2009, to 35 ZB | 2010 Digital Universe Study, IDC <http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf> | The Telegraph May 4, 2010 <http://www.telegraph.co.uk/technology/news/7675214/Digital-universe-to-smash-zettabyte-barrier-for-first-time.html> 6 Outpacing storage | 2010 Digital Universe Study, IDC <http://gigaom.files.wordpress.com/2010/05/2010-digital-universe-iview_5-4-10.pdf> 7 Some of this content is worth keeping • Citizen journalism • Hobby sites • Religious sites • News sites • Courses • Sports sites • Personal and professional blogs • Social networks • Social causes • Networked games • History of major events • References • Humor • Podcasts • Scholarly papers • Maps • TV shows • Documents • Music • Discussion groups • Poetry • Personal sites • Magazines • Books • Multimedia Art • Organizational websites • Photography • Citizens groups • Performances • Research results • Scientific knowledge • Political websites • Government websites • Internet history • Comics • Virtual worlds • Alternative magazines • Fashion • Movies • Art exhibits 8 Many believe Web content is permanent | “Digital Natives Explore Digital Preservation”, Library of Congress <http://www.youtube.com/watch?v=6fhu7s0AfmM> 9 Ever seen this? 404 Not Found 10 Yahoo! Geocities (1994-2009) 11 Web 2.0 Companies: 2006 By Ludwig Gatzke | Ludwig Gatzke, Posted to flickr on Feb. 19, 2006 <http://www.flickr.com/photos/stabilo-boss/93136022/in/set-72057594060779001/> 12 Web 2.0 Companies: 2009 By Meg Pickard | Meg Pickard, Posted to flickr on May 16, 2009 <http://www.flickr.com/photos/meg/3528372602/> 13 Mosaic Hotlist (1995) • 59 URIs – 50 HTTP URIs • 36 Dead • 13 Live • 1 Reused – 9 Gopher URIs • All Dead 14 A fleeting document of our time period “Every site on the net has its own unique characteristic and if we forget about them then no one in the future will have any idea about what we did and what was important to us in the 21st century.” | “America’s Young Archivists”, Library of Congress <http://www.youtube.com/watch?v=Gob7cjzoX3Y> 15 Web archiving today 16 Web archiving 101 • Web harvesting – Select and capture it • Preservation of captured Web content acquisition of web content acquisition of other digital content – “Digital preservation” • Keep it safe • Keep it usable to people long-term, despite technological changes preservation of web content preservation of other digital content Anatomy of a Web page Typically 1 web page = ~35 files • 17 JPEG images • 8 GIF images • 7 CSS files • 2 Javascript files • 1 HTML file (Source: representative samples taken by then Internet Archive) www.harvard.edu (6/8/2011): 58 files • 19 PNG images • 13 Javascript files • 12 GIF images • 10 JPEG images • 3 CSS files • 1 HTML file 18 Web harvesting • Download all files needed to reproduce the Web page • Try to capture the original form of the Web page as it would have been experienced at the time of capture • Also collect information about the capture process • Must be some kind of content selection… Types of harvesting • Domain harvesting – Collect the web space of an entire country • The French Web including the .fr domain • Selective harvesting – Collect based on a theme, event, individual, organization, etc. • The London 2012 Olympics • Hurricane Katrina • Women’s blogs • President Obama – Planned vs. event-based Any type of regular harvesting results in a large quantity of content to manage. The crawl Pick a location (Seed URIs) Document exchange Examine for URI references Make a request to Web server Receive response from Web server Web archiving pioneers: mid-1990s NL of Sweden Internet Archive NL of Denmark Alexa Internet NL of Australia NL of Finland Collecting Partners | Adapted from A . Potter’s presentation, IIPC GA 2010 NL of Norway International Internet Preservation Consortium (IIPC): 2003Internet Archive L and A Canada NL of Sweden NL of Denmark NL of France British Library IIPC NL of Norway Library of Congress NL of Italy | IIPC <http://netpreserve.org> NL of Finland NL of Australia NL of Iceland IIPC goals • Enable collection, preservation and long-term access of a rich body of Internet content from around the world • Foster development and use of common tools, techniques and standards • Be a strong advocate for initiatives and legislation that encourage the collection, preservation and long-term access to Internet content • Encourage and support libraries, archives, museums and cultural heritage institutions everywhere to address Internet content collecting and preservation | IIPC <http://netpreserve.org> IIPC: 2011 NL of China NL of Singapore BANQ Canada L and A Canada Bib. Alexandrina WAC (UK) NL of Israel NL of Japan NL of Korea Hanzo Archives TNA (UK) British Library / UK Harvard Library UNT (US) NYU (US) Library of Congress CDL (US) OCLC Collecting Partners AZ AI Lab (US) Internet Memory Foundation NL of Iceland NL of Finland NL of Australia Collecting Partners NL of Croatia NL of Norway NL of NZ NL of Austria NL of Spain / Catalunya NL of France / INA IIPC UIUC (US) Collecting Partners Denmark Internet Archive Archive-It Partners GPO (US) NL of Sweden NL of Scotland NL of Netherlands / VKS NL of Poland NL of Germany NL of Slovenia NL of Italy NL of Switzerland NL of Czech Republic Current methods of harvesting • Contract with other parties for crawls – Internet Archive’s crawls for the Library of Congress • Use a hosted service – Archive-It (provided by the Internet Archive) – Web Archiving Service (WAS) (provided by California Digital Library) • Set up institutional web archiving systems – Harvard’s Web Archiving Collection Service (WAX) – Most use IIPC tools like the Heritrix web crawler 26 Current methods of access • Currently dark – no access (Norway, Slovenia) • Only on-site (BnF, Finland, Austria) • Public online access (Harvard, LAC, some LC collections) • What kind of access? – Most common: browse as it was & URL search – Sometimes: also full text search – Very rare: bulk access for research – Nonexistent: cross-institutional web archive discovery/access 27 Current big challenges • Legal – High value content locked up in gated communities (Facebook); who owns what? • Technical – The Web keeps morphing; so must our capture tools – Big data--requires very scalable infrastructure (indexing, de-duplication, format identification, …) • Organizational – Web archiving is very resource intensive and competes with other institutional priorities • Preservation – Many different formats; complex interconnected content; high-maintenance rendering requirements 28 Web archiving at Harvard Web Archiving Collection Service (WAX) • Used by “curators” within Harvard units (departments, libraries, museums, etc.) to collect and preserve Web content • Content selection is a local choice • The system is managed centrally by OIS • The content is publicly available to current and future users • The content is preserved in the Digital Repository Service (DRS) managed by OIS WAX workflow • A Harvard unit sets up an account (one-time event) • On an on-going basis: – Curators within that unit specify and schedule content to crawl – WAX crawlers capture the content – Curators QA the resulting Web harvests – Curators organize the Web harvests into collections – Curators make the collections discoverable – Curators push content to the DRS – becomes publicly viewable and searchable WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end Back-end services • WAX crawlers • File Movers • Importer • Deleter • Archiver • Indexers WAX curator WAXi curator interface WAX temp storage temp index back-end services HOLLIS catalog archive user production index WAX public interface Front end DRS (preservation repository) Back end Catalog record Minimally at the collection level Sometimes also at the Web site level http://wax.lib.harvard.edu New & noteworthy 39 Web Continuity Project • The problem: 60% of the links cited in British Parliamentary debate transcripts dating from 1996-2006 were broken • The solution: when broken links are found on UK governmental websites deliver archived versions • Sites are crawled 3 times/year → UK Government Web Archive • When users click on a dead link on the live governmental site it automatically redirects to an archived version of that page * patches the present with the past * | Web Continuity Project < http://www.nationalarchives.gov.uk/information-management/policies/web-continuity.htm> 40 Memento • The problem: the Web of the past, even where it exists, is very difficult to access compared to the Web of the present • The solution: leverage Internet protocols and existing stores of past Web resources (Web archives, content management systems, revision control systems) to allow a user to specify a desired past date of the Web resource to return • Give me http://www.ietf.org as it existed around 1997 • LANL, Old Dominion, Harding University; funded by LC’s NDIIPP * a bridge from the present to the past * | Memento <http://www.mementoweb.org> 41 Memento Viewing the live Web | Memento <http://www.mementoweb.org> Viewing the past Web 42 Memento example • Using the Memento brower plugin • User sends a GET/HEAD request to http://www.ietf.org • A “timegate” is returned • User sends a request to the timegate requesting http://www.ietf.org around 1997 • A new HTTP request header “Accept-Datetime” • The timegate returns http://web.archive.org/web/19970107171109/http://www.ietf.org/ • And a new response header “Memento-Datetime” to indicate the date the URI was captured | Memento <http://www.mementoweb.org> 43 Data mining & analytics • Extract information/insight from a corpus of data (Web archives) • Can help researchers answer interesting questions about society, technology & media use, language, … • This information can enable better UIs for users • Geospatial maps, tag clouds, classification, facets, rate of change • Technical platform & tools for research • Hadoop distributed file system • Map Reduce • Google Refine • Pig Latin (scripting) • IBM BigSheets 44 How did a blogosphere form? Esther Weltervrede and Anne Helmond 45 Where do bloggers blog? Esther Weltervrede and Anne Helmond 46 Shift from linklogs & lifelogs to platformlogs 47 Collaborative collections • End of Term (EOT) collection (2008) • Before & after federal government’s public Web presence • UNT, LC, CDL, IA, GPO • Winter Olympics 2010 • IA, LC, CDL, BL, UNT, BnF • EOT and presidential elections (2011-12) • UNT, LC, CDL, IA, Harvard • Olympic & Paralympic Games 2012 • BL, ? | Winter Olympics 2010 <http://webarchives.cdlib.org/a/2010olympics> & Paralympic Games 2012 <http://www.webarchive.org.uk/ukwa/collection/4325386/page/1> | Olympic 48 Emulation / KEEP Project • Problem: how to preserve access to obsolete Web formats • (One) solution: emulate Web browsers Related projects: • Keeping Emulation Environments Portable (KEEP) • The emulation infrastructure • Knowledgebase of typical Web client environments by year • What was a typical for a given year? 49 KEEP Emulation Framework • User requests digital file in an obsolete format • The system selects and runs the best available emulator and sets up the software dependencies (OS, apps, plug-ins, drivers) • The emulators run on a KEEP Virtual Machine (VM) so that only the VM needs to be ported over time, not the emulators • 9 European institutions, led by the KB; EU-funded | KEEP <http://www.keep-project.eu> external technical registries GUI EF core engine digital file emulator archive SW archive KEEP Virtual Machine (portability) 50 Typical Web environments • Knowledgebase of typical Web client environments by year • Web browsers, plug-ins, operating systems, applications • IIPC PWG 51 Risk assessment tools • Knowledgebase of threats to preserving Web content • Example: viruses • Related annotated bibliography of risk references • Knowledgebase of risk mitigation strategies • Example: virus checking, quarantine at ingest, effective firewalls • Online tool for institutions to: • Assess risks • View other institutions’ risk assessments • (Future): Analyze risk assessments • IIPC PWG: Lead: Harvard; Participants: LC, BnF, NLNZ, LAC, KB 52 53 Thank you. Andrea Goethals andrea_goethals@harvard.edu | June 23, 2011