metadata considerations for digital libraries © Tefko Saracevic, Rutgers University 1 the Web • fastest growing technology in history • explosive growth of WWW provided – ubiquity of information and access – but also information chaos & anarchy • growing difficulty in identifying, searching & retrieving • ‘lost in an ocean’ metaphors © Tefko Saracevic, Rutgers University 2 problem • to organize & search the Web needed: knowledge about the structure of data – but Web data & databases fuzzy – structures vary widely; no consistency – constantly evolve over time – lack of agreement about meaning of even simple terms & concepts in structure © Tefko Saracevic, Rutgers University 3 solution • some standardized description or language to increase functionality – a mechanism for a more precise description of things on the Web • going from machine-readable to machine-understandable – missing in original Web architecture METADATA ! © Tefko Saracevic, Rutgers University 4 metadata © Tefko Saracevic, Rutgers University 5 what? • metadata: ‘data about data’ – machine understandable information for the Web - emphasis on machine – description of what a text (or any object) part is all about • e.g. labeling title, author, source … • many evolving standards suggested to be applied in various domains © Tefko Saracevic, Rutgers University 6 where? • in volatile digital environments – metadata describe electronic resources, texts & multimedia – metadata exist or have meaning only in relation to the referenced document or object • provide information about the object © Tefko Saracevic, Rutgers University 7 why? • to standardize description of what is what in electronic resources in order • to aid in identification, organization, & location of a great variety • to enable effective search of variety of objects (documents) distributed all over • sometimes also to provide controls (e.g. validation, rights, provenance, ratings ...) © Tefko Saracevic, Rutgers University 8 importance • standard metadata descriptions are a prerequisite to – common use – effective searching – ‘intelligent’ roaming by agents – validation, ratings, © Tefko Saracevic, Rutgers University 9 markup languages • SGML - granddaddy (standard in 1986) – marks elements within documents • derived from old markups for typesetting • adapted by communities producing electronic documents • machine independent - reason for success – transportable from one hardware & software to another; substitutes strings • many extensions & specific applications © Tefko Saracevic, Rutgers University 10 principles • ALL markup language must specify • what markup means • what markup is allowed • what markup is required • how markup is distinguished from text • all markup languages & applications follow these principles • underlying concepts are fairly simple but they get very confusing real fast. © Tefko Saracevic, Rutgers University 11 specifications • types of documents defined by DTD Document Type Definitions – many types & applications formulated • vary greatly in complexity and use • RDF - Resource Description Framework – a common syntax, data model & scheme for describing © Tefko Saracevic, Rutgers University 12 extensions • HTML - most famous & successful – allows for metatags in the Head • not used much, even discouraged • in the body could be indirect • XML - the next big thing (hopefully) • data format for structured document interchange & interoperability on WWW • increases functionality of SGML & combines with ease of use of HTML © Tefko Saracevic, Rutgers University 13 who specifies standards? • formal groups – national & international standards organizations - ISO, ANSI, NISO • informal groups – WWW Consortium (W3C) – Dublin Core – Library of Congress © Tefko Saracevic, Rutgers University 14 proliferation • currently: proliferation of metadata standards activities -many domains – a lot of confusion & incompatibility – in document description & libraries • coordination through liaisons & a number of projects in the U.S & internatioanly – strength: domain experts involvement – weakness: limited perspective; re-invention © Tefko Saracevic, Rutgers University 15 libraries • in libraries metadata has a very long tradition long preceding the Web (but not called metadata) – cataloging rules, standards • MARC (Machine Readable Cataloging) • enabled worldwide exchange of cataloging records • but long standing problems with searching © Tefko Saracevic, Rutgers University 16 sample of projects • Encoded Archival Description (EAD) • Text Encoding Initiative (TEI) • Federal Geographic Data Committee (FGDC) - geospacial data • Z39.50 standards - searching • crosswalks: mapping e.g. DC to MARC © Tefko Saracevic, Rutgers University 17 Dublin Core (DC) • international initiative to describe a core set of Web resources – a set of 15 elements Title; Creator; Subject; Description; Publisher; Contributor; Date; Type; Format; Identifier; Source; Language; Relation; Coverage; Rights • wide interest & a lot of work but not widely applied on the Web © Tefko Saracevic, Rutgers University 18 library interoperability • library catalogs bound by proprietary software & hardware • middleware needed – protocols (based on Z39.50) provide for interaction of clients with many servers (catalogs) • problems remain with semantic interoperability © Tefko Saracevic, Rutgers University 19 digitization • metadata assignment (cataloging) a key component in digitization or electronic publishing • choices: a spectrum of possibilities to select & apply metadata • search for automation - e.g. templates • connection with cataloging, indexing © Tefko Saracevic, Rutgers University 20 decisions, decision – how & what to plan for metadata creation in conjunction with dl? – target audience? – scope and depth? – what to adopt? plug-in in a scheme? – how to integrate metadata projects? – needed skills? training? staffing? © Tefko Saracevic, Rutgers University 21 $$$$ • costs of metadata: HUGE – involved operations – time, personnel, effort – learning many new things included – making decisions complex & involved • cooperative activities essential • libraries pushed out of libraries © Tefko Saracevic, Rutgers University 22 © Tefko Saracevic, Rutgers University 23