Problems and Issues in Selecting, Harvesting, and Cataloging Web Resources Joanne Archer

advertisement
Problems and Issues in Selecting,
Harvesting, and Cataloging Web
Resources
Joanne Archer
University of Maryland Libraries
Crawler
Web
Harvesting
Jargon
Seed
Crawl
Harvest
Wayback Machine
Options for Web Harvesting
i.e. Pandora, Web Curator Tool
i.e. Web Archiving Service
In House
Program
Pro: flexibility
Pro: Ease-of-use
Con: $$$
Off the
Shelf
Software
Archive-It
Third
Party
Subscription
i.e. HTTrack, Adobe Web Capture
Pro: inexpensive
Con: not-scalable
Con: $
Key Questions for Harvesting Projects
scope
Maryland’s Pilot Harvests
(2008-2010)
Historic Preservation
Maryland State Docume
Why harvest these areas?
• Builds on existing strengths in print collections
• Collections are unique
• Large amount of material migrating to the web
Key Questions for Harvesting Projects
scope
Harvesting
Harvesting Challenges:
•
•
•
•
•
•
Javascript
Streaming media
Form and database driven content
Password protected sites
Robot.txt files
Multiple hosts/subdomains
Single host = www.preservemd.org
Multiple hosts = www.umd.edu
www.lib.umd.edu
End-User Access
End-User Access
general material
designation
collection
note
URLs
subject
heading
uniform title
Conclusions
Challenges
• Start up costs
• What to collect
• Metadata creation
BUT
We are well prepared to meet the
challenges
Download