An Arizona Model for Capturing and Describing Documents on the Web Richard Pearce-Moses Director of Digital Government Information Arizona State Library, Archives and Public Records rpm at lib.az.us What Does WWW Stand For? [Collage of Robert Conrad as James West in the Wild, Wild West removed to avoid violation of copyright.] They both abbreviate to WWW Rugged Individualism Lack of standards ~ Lawlessness The Dream To collect, manage, preserve, and make useful the enormous amount of digital information our culture is now producing The Reality Two Approaches Bibliocentric (Item-by-Item) Tech-centric (Capture-It-All) Emphasis on Software Tools and Technology Limited Assistance from Content Providers Library of Congress & NDIIPP University of Illinois at Urbana-Champaign School of Library • Information Science OCLC Content Providers Tufts University Perseus Project • Michigan State University Library • State libraries: Arizona Connecticut, Illinois, North Carolina, Wisconsin • UIUC partners: NCSA • WILLAM/FM/TV • Information Management Services Digital Archives Libraries Artificial collections • Item Level Control Archives Provenance • Original Order • Hierarchy • Aggregate Control Websites as Archival Collections Documents of Common Provenance Organized into Directories (Archival Series) Publications v. Records The Art and Craft of Building a Collection What we do remains the same How we do it will change ※ Identification/Selection Acquisition Description Reference Preservation Identification — Where Do We Look? Finding the Forest az.gov • state.az.us ※ Domain Tool Identifies all distinct domains Reports new sites since previous spider Reports when sites disappear Selection: Which Collections Do We Harvest? Collection-Level Analysis Macro appraisal sets priorities Materials appraised as series Content Providers Taxonomy Tool Names • Administrative history Relationships • Subjects • Functions Selection: Which Documents Do We Harvest? Identify Series Aggregate selection Set frequency of harvests Site Analysis Tool Display structure Harmonize physical, intellectual structure Identify inaccessible content Show what’s new Show significant changes Description To be able to locate documents • when the creator or provenance is known • when the subject is known • and to aid in selection as to character Series Description • Make directory name a meaningful title • Scope and contents note • High-level subject headings • Recorded in site analysis tool database Document Description • Creator: taxonomy, internal metadata • Title: from internal metadata, noun phrases • Subject: from series metadata, internal metadata Access Finding Aids A valuable bird’s-eye view for archivists Of limited value to patrons . . . Unless they’re transformed into topic maps Full Text Search Engines Ranking Algorithms Categorization / Packaging Results Based on series-level metadata Based on autoclassification Description and Access Series-Level Description name=“Creator” Governor’s Drought Task Force Rural Watershed Alliance name=“Subject” reservoirs ground water name=“Subject” drought water conservation name=“Subject” potable water agriculture name=“Type” planning reports Categorized Results Your search for water, Phoenix Found documents in the following categories water (500+) water conservation (357) drought (110) flood control (98) Found documents from the following agencies Water Resources (135) Governor's Drought Task Force (102) Maricopa County (84) Corporation Commission (35) Salt River Project (210) xeriscape (25) Phoenix (87) Administration / Curation / Stewardship Systematic Regular Workflows Not idiosyncratic Collaborative Consensual , Not Idiosyncratic Avoid Redundant Efforts Quality Control Need for Good Metrics Need for Regular Audits Stay Tuned . . . .