Problems and Issues in Selecting, Harvesting, and Cataloging Web Resources Joanne Archer University of Maryland Libraries Crawler Web Harvesting Jargon Seed Crawl Harvest Wayback Machine Options for Web Harvesting i.e. Pandora, Web Curator Tool i.e. Web Archiving Service In House Program Pro: flexibility Pro: Ease-of-use Con: $$$ Off the Shelf Software Archive-It Third Party Subscription i.e. HTTrack, Adobe Web Capture Pro: inexpensive Con: not-scalable Con: $ Key Questions for Harvesting Projects scope Maryland’s Pilot Harvests (2008-2010) Historic Preservation Maryland State Docume Why harvest these areas? • Builds on existing strengths in print collections • Collections are unique • Large amount of material migrating to the web Key Questions for Harvesting Projects scope Harvesting Harvesting Challenges: • • • • • • Javascript Streaming media Form and database driven content Password protected sites Robot.txt files Multiple hosts/subdomains Single host = www.preservemd.org Multiple hosts = www.umd.edu www.lib.umd.edu End-User Access End-User Access general material designation collection note URLs subject heading uniform title Conclusions Challenges • Start up costs • What to collect • Metadata creation BUT We are well prepared to meet the challenges